Home	Projects	Publications	Teaching	Repository

This page contains a representative selection of my publications, categorized and arranged in reverse chronological order. Use the "»" button to reveal an abstract of each publication, and "«" to hide the abstract again (requires Javascript).

Books

L. Busoniu, L. Tamas (editors), Handling Uncertainty and Networked Structure in Robot Control, Springer, Series Studies in Systems, Decision and Control. February 2016, ISBN 978-3-319-26327-4. »
Abstract:
This book focuses on two challenges posed in robot control by the increasing adoption of robots in the everyday human environment: uncertainty and networked communication. Part I of the book describes learning control to address environmental uncertainty. Part II discusses state estimation, active sensing, and complex scenario perception to tackle sensing uncertainty. Part III completes the book with control of networked robots and multi-robot teams.
Each chapter features in-depth technical coverage and case studies highlighting the applicability of the techniques, with real robots or in simulation. Platforms include mobile ground, aerial, and underwater robots, as well as humanoid robots and robot arms.
The text gathers contributions from academic and industry experts, and offers a valuable resource for researchers or graduate students in robot control and perception. It also benefits researchers in related areas, such as computer vision, nonlinear and learning control, and multi-agent systems.
See the book's website at http://rocon.utcluj.ro/roboticsbook/ for additional information about the book and how to obtain it, as well as downloadable material. «
L. Busoniu, R. Babuska, B. De Schutter, D. Ernst, Reinforcement Learning and Dynamic Programming Using Function Approximators, CRC Press, Series Automation and Control Engineering. April 2010, 280 pages, ISBN 978-1439821084. »
Abstract:
Reinforcement learning (RL) can optimally solve decision and control problems involving complex dynamic systems, without requiring a mathematical model of the system. If a model is available, dynamic programming (DP), the model-based counterpart of RL, can be used. RL and DP are applicable in a variety of disciplines, including automatic control, artificial intelligence, economics, and medicine. Recent years have seen a surge of interest RL and DP using compact, approximate representations of the solution, which enable algorithms to scale up to realistic problems.
This book provides an in-depth introduction to RL and DP with function approximators. A concise description of classical RL and DP (Chapter 2) builds the foundation for the remainder of the book. This is followed by an extensive review of the state-of-the-art in RL and DP with approximation, which combines algorithm development with theoretical guarantees, illustrative numerical examples, and insightful comparisons (Chapter 3). Each of the final three chapters (4 to 6) is dedicated to a representative algorithm from the three major classes of methods: value iteration, policy iteration, and policy search. The features and performance of these algorithms are highlighted in extensive experimental studies on a range of control applications.
For graduate students and others new to the field, this book offers a thorough introduction to both the basics and emerging methods. And for those researchers and practitioners working in the fields of optimal and adaptive control, machine learning, artificial intelligence, and operations research, this resource offers a combination of practical algorithms, theoretical analysis, and comprehensive examples that they will be able to adapt and apply to their own work.
Access the book's website at http://rlbook.busoniu.net for additional information about the book and how to obtain it, as well as free access to a sample chapter and other downloadable material, including errata. «

Journal papers

T. Santejudean, S. Ungur, R. Herzal, C. Morarescu, V. Varma, L. Busoniu, Globally convergent path-aware optimization with mobile robots. Nonlinear Analysis - Hybrid Systems, vol. 55, pages 101546, 2025. »
Abstract: Consider a mobile robot that must navigate as quickly as possible to the global maxima of a function (e.g. density of seabed litter, pollutant concentration, wireless signal strength) defined over its operating area. This objective function is initially unknown and is assumed to be Lipschitz continuous. The limited velocity of the robot restricts the next samples to neighboring positions, and to avoid wasting time and energy, the robot’s path must be adapted as new information becomes available. The paper proposes two methods that use an upper bound on the objective to iteratively change the position targeted by the robot as new samples are acquired. The first method is FTW, which Turns When the best value seen so far of the objective Function is larger than the bound of the current target position. The second is FTWD, an extension of FTW that takes into account the Distance to the target. Convergence guarantees are provided for both methods, and a convergence rate is proven to characterize how fast the FTW suboptimality decreases as the number of samples grows. In a numerical study, FTWD greatly improves performance compared to FTW, outperforms two representative source-seeking baselines, and obtains results similar to a much more computationally intensive method that does not guarantee convergence. The relationship between FTW and FTWD is also confirmed in real-robot experiments, where a TurtleBot3 seeks the darkest point on a 2D grayscale map.
Online at ScienceDirect.
«
B. Yousuf, R. Herzal, Zs. Lendek, L. Busoniu, Multi-agent active multi-target search with intermittent measurements. Control Engineering Practice, vol. 153, pages 106094, 2024. »
Abstract: Consider a multi-agent system that must find an unknown number of static targets at unknown locations as quickly as possible. To estimate the number and positions of targets from noisy and sometimes missing measurements, we use a customized particle-based probability hypothesis density filter. Novel methods are introduced that select waypoints for the agents in a decoupled manner from taking measurements, which allows optimizing over waypoints arbitrarily far in the environment while taking as many measurements as necessary along the way. Optimization involves control cost, target refinement, and exploration of the environment. Measurements are taken either periodically, or only when they are expected to improve target detection, in an event-triggered manner. All this is done in 2D and 3D environments, for a single agent as well as for multiple homogeneous or heterogeneous agents, leading to a comprehensive framework for (Multi-Agent) Active target Search with Intermittent measurements – (MA)ASI. In simulations and real-life experiments involving a Parrot Mambo drone and a TurtleBot3 ground robot, the novel framework works better than baselines including lawnmowers, mutual-information-based methods, active search methods, and our earlier exploration-based techniques.
Online at Elsevier.
«
B. Yousuf, Zs. Lendek, L. Busoniu, Exploration-Based Planning for Multiple-Target Search with Real-Drone Results. Sensors, vol. 24, no. 9, pages 2868, 2024. »
Abstract: Consider a drone that aims to find an unknown number of staaqtic targets at unknown positions as quickly as possible. A multi-target particle filter uses imperfect measurements of the target positions to update an intensity function that represents the expected number of targets. We propose a novel receding-horizon planner that selects the next position of the drone by maximizing an objective that combines exploration and target refinement. Confidently localized targets are saved and removed from consideration along with their future measurements. A controller with an obstacle-avoidance component is used to reach the desired waypoints. We demonstrate the performance of our approach through a series of simulations as well as via a real-robot experiment in which a Parrot Mambo drone searches from a constant altitude for targets located on the floor. Target measurements are obtained on-board the drone using segmentation in the camera image, while planning is done off-board. The sensor model is adapted to the application. Both in the simulations and in the experiments, the novel framework works better than the lawnmower and active-search baselines.
Online at MDPI.
«
S. Benahmed, R. Postoyan, M. Granzotto, L. Busoniu, J. Daafouz, D. Nesic, Stability analysis of optimal control problems with time-dependent costs. Automatica, vol. 157, pages 111272, 2023. »
Abstract: We present stability conditions for deterministic time-varying nonlinear discrete-time systems whose inputs aim to minimize an infinite-horizon time-dependent cost. Global asymptotic and exponential stability properties for general attractors are established. This work covers and generalizes the related results on discounted optimal control problems to more general systems and cost functions.
Online at ScienceDirect.
«
I. Lal, I.-C. Morarescu, J. Daafouz, L. Busoniu, Optimistic planning for control of hybrid-input nonlinear systems. Automatica, vol. 154, pages 111097, 2023. »
Abstract: We propose two branch-and-bound, optimistic planning algorithms for discrete-time nonlinear optimal control problems in which there is a continuous and a discrete action (input). The dynamics and rewards (negative costs) must be Lipschitz but can otherwise be general, as long as certain boundedness conditions are satisfied by the continuous action, reward, and Lipschitz constant of the dynamics. We start by investigating the structure of the space of hybrid-input sequences. Based on this structure, we propose for the first algorithm an optimistic selection rule that picks for refinement (branching) the subset with the largest upper bound on the value. At the price of a higher budget, the second method reduces the reliance on the Lipschitz constant, by refining all sets that are potentially optimistic. This effectively means that the Lipschitz constant is automatically optimized. The way to select the largest-impact action along which to refine the sets is the same for both algorithms, and still depends on the Lipschitz constant. We provide convergence rate guarantees for both methods, which link the computational budget to the near-optimality of the action sequences returned, in a way that depends on a problem complexity measure. We also give empirical results for a nonlinear problem, where the algorithms are applied in receding horizon, and depending on the budget either one or the other algorithm works better.
Online at ScienceDirect.
«
I. Lal, I.-C. Morarescu, J. Daafouz, L. Busoniu, Near-optimal control of nonlinear systems with hybrid inputs and dwell-time constraints. IEEE Control Systems Letters, vol. 7, pages 2455–2460, 2023. »
Abstract: We propose two new optimistic planning algorithms for nonlinear hybrid-input systems, in which the input has both a continuous and a discrete component, and the discrete component must respect a dwell-time constraint. Both algorithms select sets of input sequences for refinement at each step, along with a continuous or discrete step to refine (split). The dwell-time constraint means that the discrete splits must keep the discrete mode constant if the required dwell-time is not yet reached. Convergence rate guarantees are provided for both algorithms, which show the dependency between the near-optimality of the sequence returned and the computational budget. The rates depend on a novel complexity measure of the dwell-time constrained problem. We present simulation results for two problems, an adaptive-quantization networked control system and a model for the COVID pandemic.
Online at IEEE.
«
T. Santejudean, L. Busoniu, Online learning control for path-aware global optimization with nonlinear mobile robots. Control Engineering Practice, vol. 126, 2022. »
Abstract: Consider a robot with nonlinear dynamics that must quickly find a global optimum of an objective function defined over its operating area, e.g., a chemical concentration, physical measurement, quantity of material etc. The function is initially unknown and must be learned online from samples acquired in a single trajectory. Applying classical optimization methods in this scenario would be highly suboptimal, since they would place the next sample arbitrarily far, without taking into account robot motion constraints, and would not revise the path based on new information accumulated along it. To address these limitations, a novel algorithm called Path-Aware Optimistic Optimization (OOPA) is proposed. The decision of which robot action to apply is formulated as an optimal control problem in which the rewards are refinements of the upper bound on the objective, weighted by bound and objective values to focus the search around optima. OOPA is evaluated in extensive simulations where it is compared to path-unaware optimization baselines, and in a real experiment in which a ROBOTIS TurtleBot3 successfully searches for the lowest grayscale location on a 2D surface.
Online at Elsevier.
«
M. Rosynski, L. Busoniu, A Simulator and First Reinforcement Learning Results for Underwater Mapping. Sensors, vol. 22, no. 14, 2022. »
Abstract: Underwater mapping with mobile robots has a wide range of applications, and good models are lacking for key parts of the problem, such as sensor behavior. The specific focus here is the huge environmental problem of underwater litter, in the context of the Horizon 2020 SeaClear project, where a team of robots is being developed to map and collect such litter. No reinforcement-learning solution to underwater mapping has been proposed thus far, even though the framework is well suited for robot control in unknown settings. As a key contribution, this paper therefore makes a first attempt to apply deep reinforcement learning (DRL) to this problem by exploiting two state-of-the-art algorithms and making a number of mapping-specific improvements. Since DRL often requires millions of samples to work, a fast simulator is required, and another key contribution is to develop such a simulator from scratch for mapping seafloor objects with an underwater vehicle possessing a sonar-like sensor. Extensive numerical experiments on a range of algorithm variants show that the best DRL method collects litter significantly faster than a baseline lawn mower trajectory.
Online at MDPI.
«
M. Granzotto, R. Postoyan, L. Busoniu, D. Nesic, J. Daafouz, Stable near-optimal control of nonlinear switched discrete-time systems: an optimistic planning-based approach. IEEE Transactions on Automatic Control, vol. 67, no. 5, pages 2298–2313, 2022. »
Abstract: Originating in the artificial intelligence literature, optimistic planning (OP) is an algorithm that generates near-optimal control inputs for generic nonlinear discrete-time systems whose input set is finite. This technique is therefore relevant for the near-optimal control of nonlinear switched systems for which the switching signal is the control, and no continuous input is present. However, OP exhibits several limitations, which prevent its desired application in a standard control engineering context, as it requires for instance that the stage cost takes values in [0, 1], an unnatural prerequisite, and that the cost function be discounted. In this paper, we modify OP to overcome these limitations, and we call the new algorithm OPmin. We then analyze OPmin under general stabilizability and detectability assumptions on the system and the stage cost. New near-optimality and performance guarantees for OP min are derived, which have major advantages compared to those originally given for OP. We also prove that a system whose inputs are generated by OP min in a receding-horizon fashion exhibits stability properties. As a result, OP min provides a new tool for the near-optimal, stable control of nonlinear switched discrete-time systems for generic cost functions.
Online at Elsevier.
«
Z. Nagy, Zs. Lendek, L. Busoniu, TS fuzzy observer-based controller design for a class of discrete-time nonlinear systems. IEEE Transactions on Fuzzy Systems, vol. 30, no. 2, pages 555–566, 2022. »
Abstract: This paper presents an observer-based control design approach for a class of nonlinear discrete-time systems. The model nonlinearities are handled in two ways: 1) a Takagi-Sugeno fuzzy representation is used for nonlinearities that depend on measured states, and 2) nonlinearities that depend on unmeasured states are kept in their original form and handled using a slope-bound condition. The observer-based controller design conditions are given as linear matrix inequalities. The approach we propose significantly improves results in the literature by providing less restrictive design conditions. These improvements are illustrated in a detailed analytical and numerical comparison on a synthetic example; while a pendulum-on-a-cart example shows that the approach works both in simulation and in real-time experiments.
Online at IEEE.
«
G. Feng, T.-M. Guerra, L. Busoniu, A.-T. Nguyen, S. Mohammad, Robust observer-based tracking control under actuator constraints for power-assisted wheelchairs. Control Engineering Practice, vol. 109, 2021. »
Abstract: Power-assisted wheelchairs (PWA) is an important growing market. The goal is to provide electrical assistive kits that are able to cope with a large family of disabled people and to equip a large variety of wheelchairs. This work is made in collaboration with Autonomad Mobility, a company that designs the hardware and sells Power-Assistance kits for wheelchairs. Several crucial issues arise, e.g. how to assist any Person with Reduced Mobility (PRM)? How to detect user's intentions? how to cope with the lack of system information due to excessive sensor costs. Effectively, due to the variety of wheelchairs and the different unknown PRM characteristics (mass, height, force, etc.) and pathologies, it is unrealistic to provide a solution using a precise modeling of the whole system including the wheelchair, the PRM and the ground conditions. However, proposing a safe and secure solution is obviously mandatory for this application. In particular, an on-the-market solution should be also smooth and friendly for the end-user. Estimation of the human torques is a first key point to achieve such a solution, which has been already studied in our previous works. This paper exploits these estimation results to propose a robust control law for PWA systems under saturation constraints. These constraints are unavoidable due to regulations on maximum authorized speed. From a control point of view, it resumes to an output feedback control with partially unknown references (desired speed, direction), unknown parameters (wheelchair and PRM masses, available force, ground characteristics) and input constraints. Finding an effective solution for this constrained output feedback tracking control still remains open. In this paper, we propose a two-step control design using quasi Linear Parameter Varying (q-LPV) formulation to solve this challenging control problem, i.e., first design an observer for state and unknown input estimation, and second propose a robust control scheme under parameter variations and input saturations. The control procedure is reformulated as convex optimization problems involving linear matrix inequality (LMI) constraints that can be efficiently solved with standard numerical solvers. Simulations and real-time experiments are proposed to show the effectiveness of the solution.
Online at Elsevier.
«
M. Granzotto, R. Postoyan, L. Busoniu, D. Nesic, J. Daafouz, Finite-horizon discounted optimal control: stability and performance. IEEE Transactions on Automatic Control, vol. 66, no. 2, pages 550–565, 2021. »
Abstract: Motivated by (approximate) dynamic programming and model predictive control problems, we analyse the stability of deterministic nonlinear discrete-time systems whose inputs minimize a discounted finite-horizon cost. We assume that the system satisfies stabilizability and detectability properties with respect to the stage cost. Then, a Lyapunov function for the closed-loop system is constructed and a uniform semiglobal stability property is ensured, where the adjustable parameters are both the discount factor and the horizon length, which corresponds to the number of iterations for dynamic programming algorithms like value iteration. Stronger stability properties such as global exponential stability are also provided by strengthening the initial assumptions. We give bounds on the discount factor and the horizon length under which stability holds. In addition, we provide new relationships between the optimal value functions of the discounted, undiscounted, infinite-horizon and finite-horizon costs respectively, which appear to be very different from those available in the literature.
Online at IEEEXplore.
«
Z. Nagy, Zs. Lendek, L. Busoniu, Observer design for a class of nonlinear systems with nonscalar-input nonlinear consequents. IEEE Control Systems Letters, vol. 5, no. 3, 2020. »
Abstract: This letter presents a discrete-time Takagi-Sugeno fuzzy observer design approach for a class of nonlinear systems. Instead of including all the nonlinear terms in the membership functions, some of them are kept as nonlinear consequents, and they need to fulfill a global Lipschitz condition. The form considered permits nonlinear consequents that depend on nonscalar inputs. The design conditions are defined in terms of linear matrix inequalities, and they are less restrictive than previous conditions from the literature. Two numerical examples highlight the advantages obtained.
Online at IEEE.
«
L. Busoniu, V. Varma, J. Loheac, A. Codrean, O. Stefan, C. Morarescu, S. Lasaulce, Learning control for transmission and navigation with a mobile robot under unknown communication rates. Control Engineering Practice, vol. 100, 2020. »
Abstract: In tasks such as surveying or monitoring remote regions, an autonomous robot must move while transmitting data over a wireless network with unknown, position-dependent transmission rates. For such a robot, this paper considers the problem of transmitting a data buffer in minimum time, while possibly also navigating towards a goal position. Two approaches are proposed, each consisting of a machine-learning component that estimates the rate function from samples; and of an optimal-control component that moves the robot given the current rate function estimate. Simple obstacle avoidance is performed for the case without a goal position. In extensive simulations, these methods achieve competitive performance compared to known-rate and unknown-rate baselines. A real indoor experiment is provided in which a Parrot AR.Drone 2 successfully learns to transmit the buffer.
Online at ScienceDirect.
«
D. Mezei, L. Tamas, L. Busoniu, Sorting Objects from a Conveyor Belt Using POMDPs with Multiple-Object Observations and Information-Gain Rewards. Sensors, vol. 20, no. 9, 2020. »
Abstract: We consider a robot that must sort objects transported by a conveyor belt into different classes. Multiple observations must be performed before taking a decision on the class of each object, because the imperfect sensing sometimes detects the incorrect object class. The objective is to sort the sequence of objects in a minimal number of observation and decision steps. We describe this task in the framework of partially observable Markov decision processes, and we propose a reward function that explicitly takes into account the information gain of the viewpoint selection actions applied. The DESPOT algorithm is applied to solve the problem, automatically obtaining a sequence of observation viewpoints and class decision actions. Observations are made either only for the object on the first position of the conveyor belt or for multiple adjacent positions at once. The performance of the single- and multiple-position variants is compared, and the impact of including the information gain is analyzed. Real-life experiments with a Baxter robot and an industrial conveyor belt are provided.
Online at MDPI.
«
C. Morarescu, V. Varma, L. Busoniu, S. Lasaulce, Space-time budget allocation policy design for viral marketing. Nonlinear Analysis - Hybrid Systems, vol. 37, 2020. »
Abstract: We address formally the problem of opinion dynamics when the agents of a social network (e.g., consumers) are not only influenced by their neighbors but also by an external influential entity referred to as a marketer. The influential entity tries to sway the overall opinion as close as possible to a desired opinion by using a specific influence budget. We assume that the exogenous influences of the entity happen during discrete-time advertising campaigns; consequently, the overall closed-loop opinion dynamics becomes a linear-impulsive (hybrid) one. The main technical issue addressed is finding how the marketer should allocate its budget over time (through marketing campaigns) and over space (among the agents) such that the agents' opinion be as close as possible to the desired opinion. Our main results show that the marketer has to prioritize certain agents over others based on their initial condition, their influence power in the social graph and the size of the cluster they belong to. The corresponding space-time allocation problem is formulated and solved for several special cases of practical interest. Valuable insights can be extracted from our analysis. For instance, for most cases, we prove that the marketer has an interest in investing most of its budget at the beginning of the process and that budget should be shared among agents according to the famous water-filling allocation rule. Numerical examples illustrate the analysis.
Online at ScienceDirect.
«
L. Busoniu, J. Ben Rejeb, I. Lal, I.-C. Morarescu, J. Daafouz, Optimistic minimax search for noncooperative switched control with or without dwell time. Automatica, vol. 112, 2020. »
Abstract: We consider adversarial problems in which two agents control two switching signals, the first agent aiming to maximize a discounted sum of rewards, and the second aiming to minimize it. Both signals may be subject to constraints on the dwell time after a switch. We search the tree of possible mode sequences with an algorithm called optimistic minimax search with dwell time (OMSd), showing that it obtains a solution close to the minimax-optimal one, and we characterize the rate at which the suboptimality goes to zero. The analysis is driven by a novel measure of problem complexity, and it is first given in the general dwell-time case, after which it is specialized to the unconstrained case. We exemplify the framework for networked control systems where the minimizer signal is a discrete time delay on the control channel, and we provide extensive simulations and a real-time experiment for nonlinear systems of this type.
Online at ScienceDirect.
«
G. Feng, L. Busoniu, T.M. Guerra, S. Mohammad, Data-Efficient Reinforcement Learning for Energy Optimization of Power-Assisted Wheelchairs. IEEE Transactions on Industrial Electronics, vol. 66, no. 12, pages 97340–9744, 2019. »
Abstract: The objective of this paper is to develop a method for assisting users to push power-assisted wheelchairs (PAWs) in such a way that the electrical energy consumption over a predefined distance-to-go is optimal, while at the same time bringing users to a desired fatigue level. This assistive task is formulated as an optimal control problem and solved by Feng et al. using the model-free approach gradient of partially observable Markov decision processes. To increase the data efficiency of the model-free framework, we here propose to use policy learning by weighting exploration with the returns (PoWER) with 25 control parameters. Moreover, we provide a new near-optimality analysis of the finite-horizon fuzzy Q-iteration, which derives a model-based baseline solution to verify numerically the near-optimality of the presented model-free approaches. Simulation results show that the PoWER algorithm with the new parameterization converges to a near-optimal solution within 200 trials and possesses the adaptability to cope with changes of the human fatigue dynamics. Finally, 24 experimental trials are carried out on the PAW system, with fatigue feedback provided by the user via a joystick. The performance tends to increase gradually after learning. The results obtained demonstrate the effectiveness and the feasibility of PoWER in our application.
Online at IEEEXplore.
«
L. Busoniu, T. de Bruin, D. Tolic, J. Kober, I. Palunko, Reinforcement Learning for Control: Performance, Stability, and Deep Approximators. Annual Reviews in Control, vol. 46, pages 8–28, 2018. »
Abstract: Reinforcement learning (RL) offers powerful algorithms to search for optimal controllers of systems with nonlinear, possibly stochastic dynamics that are unknown or highly uncertain. This review mainly covers artificial-intelligence approaches to RL, from the viewpoint of the control engineer. We explain how approximate representations of the solution make RL feasible for problems with continuous states and control actions. Stability is a central concern in control, and we argue that while the control-theoretic RL subfield called adaptive dynamic programming is dedicated to it, stability of RL largely remains an open question. We also cover in detail the case where deep neural networks are used for approximation, leading to the field of deep RL, which has shown great success in recent years. With the control practitioner in mind, we outline opportunities and pitfalls of deep RL; and we close the survey with an outlook that - among other things - points out some avenues for bridging the gap between control and artificial-intelligence RL techniques.
Online at ScienceDirect.
«
L. Busoniu, E.Pall, R. Munos, Continuous-action planning for discounted infinite-horizon nonlinear optimal control with Lipschitz values. Automatica, vol. 92, pages 100–108, 2018. »
Abstract: We consider discrete-time, infinite-horizon optimal control problems with discounted rewards. The value function must be Lipschitz continuous over action (input) sequences, the actions are in a scalar interval, while the dynamics and rewards can be nonlinear/nonquadratic. Exploiting ideas from artificial intelligence, we propose two optimistic planning methods that perform an adaptive-horizon search over the infinite-dimensional space of action sequences. The first method optimistically refines regions with the largest upper bound on the optimal value, using the Lipschitz constant to find the bounds. The second method simultaneously refines all potentially optimistic regions, without explicitly using the bounds. Our analysis proves convergence rates to the global infinite-horizon optimum for both algorithms, as a function of computation invested and of a measure of problem complexity. It turns out that the second, simultaneous algorithm works nearly as well as the first, despite not needing to know the (usually difficult to find) Lipschitz constant. We provide simulations showing the algorithms are useful in practice, compare them with value iteration and model predictive control, and give a real-time example.
Online at ScienceDirect.
«
K. Mathe, L. Busoniu, R. Munos, B. De Schutter, Optimistic planning with an adaptive number of action switches for near-optimal nonlinear control. Engineering Applications of Artificial Intelligence, vol. 67, 2018. »
Abstract: We consider infinite-horizon optimal control of nonlinear systems where the control actions are discrete, and focus on optimistic planning algorithms from artificial intelligence, which can handle general nonlinear systems with nonquadratic costs. With the main goal of reducing computations, we introduce two such algorithms that only search for constrained action sequences. The constraint prevents the sequences from switching between different actions more than a limited number of times. We call the first method optimistic switch-limited planning (OSP), and develop analysis showing that its fixed number of switches SS leads to polynomial complexity in the search horizon, in contrast to the exponential complexity of the existing OP algorithm for deterministic systems; and to a correspondingly faster convergence towards optimality. Since tuning SS is difficult, we introduce an adaptive variant called OASP that automatically adjusts SS so as to limit computations while ensuring that near-optimal solutions keep being explored. OSP and OASP are analytically evaluated in representative special cases, and numerically illustrated in simulations of a rotational pendulum. To show that the algorithms also work in challenging applications, OSP is used to control the pendulum in real time, while OASP is applied for trajectory control of a simulated quadrotor.
Online at ScienceDirect.
«
R. Postoyan, L. Busoniu, D. Nesic, J. Daafouz, Stability Analysis of Discrete-Time Infinite-Horizon Optimal Control with Discounted Cost. IEEE Transactions on Automatic Control, vol. 62, no. 6, pages 2736–2749, 2017. »
Abstract: We analyse the stability of general nonlinear discrete-time systems controlled by an optimal sequence of inputs that minimizes an infinite-horizon discounted cost. First, assumptions related to the controllability of the system and its detectability with respect to the stage cost are made. Uniform semiglobal and practical stability of the closed-loop system is then established, where the adjustable parameter is the discount factor. Stronger stability properties are thereupon guaranteed by gradually strengthening the assumptions. Next, we show that the Lyapunov function used to prove stability is continuous under additional conditions, implying that stability has a certain amount of nominal robustness. The presented approach is flexible and we show that robust stability can still be guaranteed when the sequence of inputs applied to the system is no longer optimal but near-optimal. We also analyse stability for cost functions in which the importance of the stage cost increases with time, opposite to discounting. Finally, we exploit stability to derive new relationships between the optimal value functions of the discounted and undiscounted problems, when the latter is well-defined.
Online at IEEEXplore.
«
L. Busoniu, J. Daafouz, M.-C. Bragagnolo, C. Morarescu, Planning for optimal control and performance certification in nonlinear systems with controlled or uncontrolled switches. Automatica, vol. 78, pages 297–308, 2017. »
Abstract: We consider three problems for discrete-time switched systems with autonomous, general nonlinear modes. The first is optimal control of the switching rule so as to optimize the infinite-horizon discounted cost. The second and third problems occur when the switching rule is uncontrolled, and we seek either the worst-case cost when the rule is unknown, or respectively the expected cost when the rule is stochastic. We use optimistic planning (OP) algorithms that can solve general optimal control with discrete inputs such as switches. We extend the analysis of OP to provide certification (upper and lower) bounds on the optimal, worst-case, or expected costs, as well as to design switching sequences that achieve these bounds in the deterministic case. In this case, since a minimum dwell time between switching instants is often required, we introduce a new OP variant to handle this constraint, and analyze its convergence rate. We provide consistency and closed-loop performance guarantees for the sequences designed, and illustrate that the approach works well in simulations.
Online at ScienceDirect.
«
L. Busoniu, A. Daniels, R. Babuska, Online Learning for Optimistic Planning. Engineering Applications of Artificial Intelligence, vol. 55, pages 60–72, 2016. »
Abstract: Markov decision processes are a powerful framework for nonlinear, possibly stochastic optimal control. We consider two existing optimistic planning algorithms to solve them, which originate in artificial intelligence. These algorithms have provable near-optimal performance when the actions and possible stochastic next-states are discrete, but they wastefully discard the planning data after each step. We therefore introduce a method to learn online, from this data, the upper bounds that are used to guide the planning process. Five different approximators for the upper bounds are proposed, one of which is specifically adapted to planning, and the other four coming from the standard toolbox of function approximation. Our analysis characterizes the influence of the approximation error on the performance, and reveals that for small errors, learning-based planning performs better. In detailed experimental studies, learning leads to improved performance with all five representations, and a local variant of support vector machines provides a good compromise between performance and computation.
Online at ScienceDirect.
«
L. Busoniu, R. Postoyan, J. Daafouz, Near-optimal Strategies for Nonlinear and Uncertain Networked Control Systems. IEEE Transactions on Automatic Control, vol. 61, no. 8, pages 2124–2139, 2016. »
Abstract: We consider problems where a controller communicates with a general nonlinear plant via a network, and must optimize a performance index. The system is modeled in discrete time and may be affected by a class of stochastic uncertainties that can take finitely many values. Admissible inputs are constrained to belong to a finite set. Exploiting some optimistic planning algorithms from the artificial intelligence field, we propose two control strategies that take into account the communication constraints induced by the use of the network. Both strategies send in a single packet long-horizon solutions, such as sequences of inputs. Our analysis characterizes the relationship between computation, near-optimality, and transmission intervals. In particular, the first strategy imposes at each transmission a desired near-optimality, which we show is related to an imposed transmission period; for this setting, we analyze the required computation. The second strategy has a fixed computation budget, and within this constraint it adapts the next transmission instant to the last state measurement, leading to a self-triggered policy. For this case, we guarantee long transmission intervals. Examples and simulation experiments are provided throughout the paper.
Online at IEEEXplore.
«
K. Mathe, L. Busoniu, Vision and Control for UAVs: A Survey of General Methods and of Inexpensive Platforms for Infrastructure Inspection. Sensors, vol. 15, no. 7, pages 14887–14916, 2015. »
Abstract: Unmanned aerial vehicles (UAVs) have gained significant attention in recent years. Low-cost platforms using inexpensive sensor payloads have been shown to provide satisfactory flight and navigation capabilities. In this report, we survey vision and control methods that can be applied to low-cost UAVs, and we list some popular inexpensive platforms and application fields where they are useful. We also highlight the sensor suites used where this information is available. We overview, among others, feature detection and tracking, optical flow and visual servoing, low-level stabilization and high-level planning methods. We then list popular low-cost UAVs, selecting mainly quadrotors. We discuss applications, restricting our focus to the field of infrastructure inspection. Finally, as an example, we formulate two use-cases for railway inspection, a less explored application field, and illustrate the usage of the vision and control techniques reviewed by selecting appropriate ones to tackle these use-cases. To select vision methods, we run a thorough set of experimental evaluations.
Online at MDPI.
«
L. Busoniu, C. Morarescu, Topology-Preserving Flocking of Nonlinear Agents Using Optimistic Planning. Control Theory and Technology, vol. 13, no. 1, pages 70–81, 2015. »
Abstract: We consider the generalized flocking problem in multiagent systems, where the agents must drive a subset of their state variables to common values, while communication is constrained by a proximity relationship in terms of another subset of variables. We build a flocking method for general nonlinear agent dynamics, by using at each agent a near-optimal control technique from artificial intelligence called optimistic planning. By defining the rewards to be optimized in a well-chosen way, the preservation of the interconnection topology is guaranteed, under a controllability assumption. We also give a practical variant of the algorithm that does not require to know the details of this assumption, and show that it works well in experiments on nonlinear agents.
Online at CTT.
«
L. Busoniu, C. Morarescu, Consensus for Black-Box Nonlinear Agents Using Optimistic Optimization. Automatica, vol. 50, no. 4, pages 1201–1208, 2014. »
Abstract: An important problem in multiagent systems is consensus, which requires the agents to agree on certain controlled variables of interest. We focus on the challenge of dealing in a generic way with nonlinear agent dynamics, represented as a black box with unknown mathematical form. Our approach designs a reference behavior with a classical consensus method. The main novelty is using optimistic optimization (OO) to find controls that closely follow the reference behavior. The first advantage of OO is that it only needs to sample the black-box model of the agent, and so achieves our goal of handling unknown nonlinearities. Secondly, a tight relationship is guaranteed between computation invested and closeness to the reference behavior. Our main results exploit these properties to prove practical consensus. An analysis of representative examples builds additional insight and shows that in some nontrivial problems the optimization is easy to solve by OO. Simulations on these examples accompany the analysis.
Online at ScienceDirect.
«
I. Grondman, L. Busoniu, G. Lopes, R. Babuska, A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients. IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews, vol. 42, no. 6, pages 1291–1307, 2012. »
Abstract: Policy-gradient-based actor-critic algorithms are amongst the most popular algorithms in the reinforcement learning framework. Their advantage of being able to search for optimal policies using low-variance gradient estimates has made them useful in several real-life applications, such as robotics, power control, and finance. Although general surveys on reinforcement learning techniques already exist, no survey is specifically dedicated to actor-critic algorithms in particular. This paper, therefore, describes the state of the art of actor-critic algorithms, with a focus on methods that can work in an online setting and use function approximation in order to deal with continuous state and action spaces. After starting with a discussion on the concepts of reinforcement learning and the origins of actor-critic algorithms, this paper describes the workings of the natural gradient, which has made its way into many actor-critic algorithms over the past few years. A review of several standard and natural actor-critic algorithms is given, and the paper concludes with an overview of application areas and a discussion on open issues.
Online at IEEEXplore.
«
I. Grondman, M. Vaandrager, L. Busoniu, R. Babuska, E. Schuitema, Efficient Model Learning Methods for Actor-Critic Control. IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics, vol. 42, no. 3, pages 591–602, 2012. »
Abstract: We propose two new actor–critic algorithms for reinforcement learning. Both algorithms use local linear regression (LLR) to learn approximations of the functions involved. A crucial feature of the algorithms is that they also learn a process model, and this, in combination with LLR, provides an efficient policy update for faster learning. The first algorithm uses a novel model-based update rule for the actor parameters. The second algorithm does not use an explicit actor but learns a reference model which represents a desired behavior, from which desired control actions can be calculated using the inverse of the learned process model. The two novel methods and a standard actor–critic algorithm are applied to the pendulum swing-up problem, in which the novel methods achieve faster learning than the standard algorithm.
Online at IEEEXplore.
«
S. Adam, L. Busoniu, R. Babuska, Experience Replay for Real-Time Reinforcement Learning Control. IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews, vol. 42, no. 2, pages 201–212, 2012. »
Abstract: Reinforcement learning (RL) algorithms can automatically learn optimal control strategies for nonlinear, possibly stochastic systems. A promising approach for RL control is experience replay (ER), which quickly learns from a limited amount of data by repeatedly presenting these data to an underlying RL algorithm. Despite its benefits, ER RL has been studied only sporadically in the literature, and its applications have largely been confined to simulated systems. Therefore, in this paper we evaluate ER RL on real-time control experiments involving a pendulum swing-up problem and the vision-based control of a goalkeeper robot. These real-time experiments are complemented by simulation studies and comparisons with traditional RL. As a preliminary, we develop a general ER framework that can be combined with essentially any incremental RL technique, and instantiate this framework for the approximate Q-learning and SARSA algorithms. The successful real-time learning results presented here are highly encouraging for the applicability of ER RL in practice.
Online at IEEEXplore.
«
L. Busoniu, D. Ernst, B. De Schutter, R. Babuska, Cross-Entropy Optimization of Control Policies with Adaptive Basis Functions. IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics, vol. 41, no. 1, pages 196–209, 2011. »
Abstract: This paper introduces an algorithm for direct search of control policies in continuous-state, discrete-action Markov decision processes. The algorithm looks for the best closed-loop policy that can be represented using a given number of basis functions (BFs), where a discrete action is assigned to each BF. The type of the BFs and their number are specified in advance and determine the complexity of the representation. Considerable flexibility is achieved by optimizing the locations and shapes of the BFs, together with the action assignments. The optimization is carried out with the cross-entropy method and evaluates the policies by their empirical return from a representative set of initial states. The return for each representative state is estimated using Monte Carlo simulations. The resulting algorithm for cross-entropy policy search with adaptive BFs is extensively evaluated in problems with two to six state variables, for which it reliably obtains good policies with only a small number of BFs. In these experiments, cross-entropy policy search requires vastly fewer BFs than value-function techniques with equidistant BFs, and outperforms policy search with a competing optimization algorithm called DIRECT.
Online at IEEEXplore.
«
L. Busoniu, D. Ernst, B. De Schutter, R. Babuska, Approximate Dynamic Programming with a Fuzzy Parametrization. Automatica, vol. 46, no. 5, pages 804–814, 2010. »
Abstract: Dynamic programming (DP) is a powerful paradigm for general, nonlinear optimal control. Computing exact DP solutions is in general only possible when the process states and the control actions take values in a small discrete set. In practice, it is necessary to approximate the solutions. Therefore, we propose an algorithm for approximate DP that relies on a fuzzy partition of the state space, and on a discretization of the action space. This fuzzy Q-iteration algorithm works for deterministic processes, under the discounted return criterion. We prove that fuzzy Q-iteration asymptotically converges to a solution that lies within a bound of the optimal solution. A bound on the suboptimality of the solution obtained in a finite number of iterations is also derived. Under continuity assumptions on the dynamics and on the reward function, we show that fuzzy Q-iteration is consistent, i.e., that it asymptotically obtains the optimal solution as the approximation accuracy increases. These properties hold both when the parameters of the approximator are updated in a synchronous fashion, and when they are updated asynchronously. The asynchronous algorithm is proven to converge at least as fast as the synchronous one. The performance of fuzzy Q-iteration is illustrated in a two-link manipulator control problem.
Online at ScienceDirect.
«
L. Busoniu, R. Babuska, B. De Schutter, A Comprehensive Survey of Multi-Agent Reinforcement Learning. IEEE Transactions on Systems, Man, and Cybernetics — Part C: Applications and Reviews, vol. 38, no. 2, pages 156–172, 2008. Recipient of the 2009 Andrew P. Sage Award for the best paper published annually in the IEEE Transactions on Systems, Man and Cybernetics. »
Abstract: Multi-agent systems are rapidly finding applications in a variety of domains, including robotics, distributed control, telecommunications, and economics. The complexity of many tasks arising in these domains makes them difficult to solve with preprogrammed agent behaviors. The agents must instead discover a solution on their own, using learning. A significant part of the research on multi-agent learning concerns reinforcement learning techniques. This paper provides a comprehensive survey of multi-agent reinforcement learning (MARL). A central issue in the field is the formal statement of the multi-agent learning goal. Different viewpoints on this issue have led to the proposal of many different goals, among which two focal points can be distinguished: stability of the agents' learning dynamics, and adaptation to the changing behavior of the other agents. The MARL algorithms described in the literature aim—either explicitly or implicitly—at one of these two goals or at a combination of both, in a fully cooperative, fully competitive, or more general setting. A representative selection of these algorithms is discussed in detail in this paper, together with the specific issues that arise in each category. Additionally, the benefits and challenges of MARL are described along with some of the problem domains where MARL techniques have been applied. Finally, an outlook for the field is provided.
Keywords: multi-agent systems, reinforcement learning, game theory, distributed control.
This is an extended and revised version of the ICARCV-06 MARL paper.

Online at IEEEXplore.
«

Contributions to books

M. Bragagnolo, C. Morarescu, L. Busoniu, P. Riedinger, Decentralized Formation Control in Fleets of Nonholonomic Robots with a Clustered Pattern. In Handling Uncertainty and Structure in Robot Control, L. Busoniu, L. Tamas, Editors, pages 313–333. Springer, 2016. »
Abstract: In this work we consider a fleet of non-holonomic robots that has to realize a formation in a decentralized and collaborative manner. The fleet is clustered due to communication or energy-saving constraints. We assume that each robot continuously measures its relative distance to other robots belonging to the same cluster. Due to this, the robots communicate on a directed connected graph within each cluster. On top of this, in each cluster there exists one robot called leader that receives information from other leaders at discrete instants. In order to realize the formation we propose a two-step strategy. First, the robots compute reference trajectories using a linear consensus protocol. Second, a classical tracking control strategy is used to follow the references. Overall, formation realization is obtained. Numerical simulations with robot teams illustrate the effectiveness of this approach.
Online at SpringerLink.
«
E. Pall, L. Tamas, L. Busoniu, Vision-Based Quadcopter Navigation in Structured Environments. In Handling Uncertainty and Structure in Robot Control, L. Busoniu, L. Tamas, Editors, pages 265–290. Springer, 2016. »
Abstract: Quadcopters are small-sized aerial vehicles with four fixed-pitch propellers. These robots have great potential since they are inexpensive with affordable hardware, and with appropriate software solutions they can accomplish assignments autonomously. They could perform daily tasks in the future, such as package deliveries, inspections, and rescue missions. In this chapter, after an extensive introduction to object recognition and tracking, we present an approach for vision-based autonomous flying of an unmanned quadcopter in various structured environments, such as hallway-like scenes. The desired flight direction is obtained visually, based on perspective clues, in particular the vanishing point. This point is the intersection of parallel lines viewed in perspective, and is sought on the front camera image. For a stable guidance the position of the vanishing point is filtered with different types of probabilistic filters, such as linear Kalman filter, extended Kalman filter, unscented Kalman filter and particle filter. These are compared in terms of the tracking error and also for computational time. A switching control method is implemented. Each of the modes focuses on controlling only one state variable at a time and the objective is to center the vanishing point on the image. The selected filtering and control methods are tested successfully, both in simulation and in real indoor and outdoor environments.
Online at SpringerLink.
«
L. Busoniu, R. Munos, R. Babuska, A Review of Optimistic Planning in Markov Decision Processes. In Reinforcement Learning and Adaptive Dynamic Programming for Feedback Control, F. Lewis, D. Liu, Editors, series Computational Intelligence, pages 494–516. Wiley, 2012. »
Abstract: We review a class of online planning algorithms for deterministic and stochastic optimal control problems, modeled as Markov decision processes. At each discrete time step, these algorithms maximize the predicted value of planning policies from the current state, and apply the first action of the best policy found. An overall receding-horizon algorithm results, which can also be seen as a type of model-predictive control. The space of planning policies is explored optimistically, focusing on areas with largest upper bounds on the value – or upper confidence bounds, in the stochastic case. The resulting optimistic planning framework integrates several types of optimism previously used in planning, optimization, and reinforcement learning, in order to obtain several intuitive algorithms with good performance guarantees. We describe in detail three recent such algorithms, outline the theoretical guarantees on their performance, and illustrate their behavior in a numerical example.
Online at Wiley Online Library.
«
L. Busoniu, A. Lazaric, M. Ghavamzadeh, R. Munos, R. Babuska, B. De Schutter, Least-Squares Methods for Policy Iteration. In Reinforcement Learning: State of the Art, M. Wiering, M. van Otterlo, Editors, series Adaptation, Learning, and Optimization, no. 12, pages 75–109. Springer, 2012. »
Abstract: Approximate reinforcement learning deals with the essential problem of applying reinforcement learning in large and continuous state-action spaces, by using function approximators to represent the solution. This chapter reviews least-squares methods for policy iteration, an important class of algorithms for approximate reinforcement learning. We discuss three techniques for solving the core, policy evaluation component of policy iteration, called: least-squares temporal difference, least-squares policy evaluation, and Bellman residual minimization.We introduce these techniques starting from their general mathematical principles and detailing them down to fully specified algorithms. We pay attention to online variants of policy iteration, and provide a numerical example highlighting the behavior of representative offline and online methods. For the policy evaluation component as well as for the overall resulting approximate policy iteration, we provide guarantees on the performance obtained asymptotically, as the number of samples processed and iterations executed grows to infinity. We also provide finite-sample results, which apply when a finite number of samples and iterations are considered. Finally, we outline several extensions and improvements to the techniques and methods reviewed.
Online at SpringerLink.
«
L. Busoniu, B. De Schutter, R. Babuska, Approximate Dynamic Programming and Reinforcement Learning. In Interactive Collaborative Information Systems, R. Babuska, F.C.A. Groen, Editors, series Studies in Computational Intelligence, no. 281, pages 3–44. Springer, 2010. »
Abstract: DP and RL can be used to address problems from a variety of fields, including automatic control, artificial intelligence, operations research, and economy. Many problems in these fields are described by continuous variables, whereas DP and RL can find exact solutions only in the discrete case. Therefore, approximation is essential in practical DP and RL. This chapter provides an in-depth review of the literature on approximate DP and RL in large or continuous-space, infinite-horizon problems. Value iteration, policy iteration, and policy search approaches are presented in turn. Model-based (DP) as well as online and batch model-free (RL) algorithms are discussed. We review theoretical guarantees on the approximate solutions produced by these algorithms. Numerical examples illustrate the behavior of several representative algorithms in practice. Techniques to automatically derive value function approximators are discussed, and a comparison between value iteration, policy iteration, and policy search is provided. The chapter closes with a discussion of open issues and promising research directions in approximate DP and RL.
Online at SpringerLink.
«
L. Busoniu, R. Babuska, B. De Schutter, Multi-Agent Reinforcement Learning: An Overview. In Innovations in Multi-Agent Systems and Applications, D. Srinivasan, L. Jain, Editors, series Studies in Computational Intelligence, no. 310, pages 183–221. Springer, 2010. »
Abstract: Multi-agent systems can be used to address problems in a variety of domains, including robotics, distributed control, telecommunications, and economics. The complexity of many tasks arising in these domains makes them difficult to solve with preprogrammed agent behaviors. The agents must instead discover a solution on their own, using learning. A significant part of the research on multi-agent learning concerns reinforcement learning techniques. This chapter reviews a representative selection of MARL algorithms for fully cooperative, fully competitive, and more general (neither cooperative nor competitive) tasks. The benefits and challenges of MARL are described. A central challenge in the field is the formal statement of a multi-agent learning goal; this chapter reviews the learning goals proposed in the literature. The problem domains where MARL techniques have been applied are briefly discussed. Several MARL algorithms are applied to an illustrative example involving the coordinated transportation of an object by two cooperative robots. In an outlook for the MARL field, a set of important open issues are identified, and promising research directions to address these issues are outlined.
The code used in the example is available for download, as part of the MARL toolbox in the Repository section.
This is an extended and revised version of the SMC 2008 paper above.

Online at SpringerLink.
«
L. Busoniu, D. Ernst, B. De Schutter, R. Babuska, Continuous-State Reinforcement Learning with Fuzzy Approximation. In Adaptive Agents and Multi-Agent Systems III, K. Tuyls, A. Nowe, Z. Guessoum, D. Kudenko, Editors, series Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol. 4865, pages 27–43. Springer, 2008. »
Abstract: Reinforcement learning (RL) is a widely used learning paradigm for adaptive agents. There exist several convergent and consistent RL algorithms which have been intensively studied. In their original form, these algorithms require that the environment states and agent actions take values in a relatively small discrete set. Fuzzy representations for approximate, model-free RL have been proposed in the literature for the more difficult case where the state-action space is continuous. In this work, we propose a fuzzy approximation architecture similar to those previously used for Q-learning, but we combine it with the model-based Q-value iteration algorithm. We prove that the resulting algorithm converges. We also give a modified, asynchronous variant of the algorithm that converges at least as fast as the original version. An illustrative simulation example is provided.
This is an extended and revised version of the ALAMAS-07 paper.
Online at SpringerLink.
«

Conference papers

T. Alinei-Poiana, D. Rete, D. Martinovici, V.-M. Maer, L. Busoniu, A BlueROV2-based platform for underwater mapping experiments. In Proceedings of the 15th IFAC Conference on Control Applications in Marine Systems, Robotics and Vehicles (CAMS-24), pages 470–475, Blacksburg, US., 3–5 September 2024. »
Abstract: We propose a low-cost laboratory platform for development and validation of underwater mapping techniques, using the BlueROV2 Remotely Operated Vehicle (ROV). Both the ROV and the objects to be mapped are placed in a pool that is imaged via an overhead camera. In our prototype mapping application, the ROV's pose is found using extended Kalman filtering on measurements from the overhead camera, inertial, and pressure sensors; while objects are detected with a deep neural network in the ROV camera stream. Validation experiments are performed for pose estimation, detection, and mapping. The litter detection dataset and code are made publicly available.
Online at ScienceDirect.
«
X. Zhao, C. Wang, J. Xu, L. Busoniu, Multi-Agent Collision Avoidance Based on DRL and ORCA. In Proceedings 43rd Chinese Control Conference (CCC-24), pages 6016–6021, Kunming, China, 28–31 July 2024. »
Abstract: This paper proposes a distributed multi-agent collision avoidance model in dynamic and complex environments based on ORCA and DRL. The main work combines data-driven reinforcement learning with model-based knowledge by integrating imitation learning into reinforcement learning and by designing more effective observations and reward functions. Four strategies are compared, and results demonstrate that our method exhibits superior capabilities.
Online at IEEEXplore.
«
V.-M. Maer, Zs. Lendek, S. Pirje, D. Tolic, A. Djuras, V. Prkacin, I. Palunko, L. Busoniu, Two-channel extended Kalman filtering with intermittent measurements. In Proceedings 2024 IEEE American Control Conference (ACC-24), Toronto, Canada, 10–12 July 2024. »
Abstract: We consider two nonlinear state estimation prob-lems in a setting where an extended Kalman filter receives measurements from two sets of sensors via two channels (2C). In the stochastic-2C problem, the channels drop measurements stochastically, whereas in 2C scheduling, the estimator chooses when to read each channel. In the first problem, we generalize linear-case 2C analysis to obtain - for a given pair of channel arrival rates - boundedness conditions for the trace of the error covariance, as well as a worst-case upper bound. For scheduling, an optimization problem is solved to find arrival rates that balance low channel usage with low trace bounds, and channels are read deterministically with the expected periods corresponding to these arrival rates. We validate both solutions in simulations for linear and nonlinear dynamics; as well as in a real experiment with an underwater robot whose position is being intermittently found in a UAV camera image.
Online at IEEEXplore.
«
E. Pop, I. Lal, L. Busoniu, Real-Time Simultaneous Optimistic Planning for Hybrid-Input Nonlinear Optimal Control. In Proceedings of the 2024 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR-24), Cluj-Napoca, Romania, 16–18 May 2024. »
Abstract: Simultaneous Optimistic Planning for Hybrid-Input Systems (SOPHIS) is a powerful method for the near-optimal control of nonlinear systems with hybrid - continuous and discrete - inputs, which works by iteratively splitting sets of input sequences. The generality of SOPHIS however comes at high computational costs that are often untenable in real-time control, especially for fast unstable systems. We introduce two modifications that make SOPHIS more suitable for real-time control: running it on a separate machine, over multiple sampling periods, while applying several inputs to the system during this time; and parallelizing the algorithm by splitting several sets simultaneously across multiple threads. Experiments investigate two parallelization schemes, the impact of thread count on the execution time, and the influence of the prediction horizon and budget; the latter on a real-life fast unstable system, a rotary inverted pendulum. In the experiments, the discrete input controls the quantization accuracy of the control action sent to the system.
Online at IEEEXplore.
«
M. Rosynski, A. Pop, L. Busoniu, Active search and coverage using point-cloud reinforcement learning. In Proceedings of the 27th International Conference on System Theory, Control and Computing (ICSTCC-23), pages 289–296, Timisoara, Romania, 11–13 October 2023. »
Abstract: We consider a problem in which the trajectory of a mobile 3D sensor must be optimized so that certain objects are both found in the overall scene and covered by the point cloud, as fast as possible. This problem is called target search and coverage, and the paper provides an end-to-end deep reinforcement learning (RL) solution to solve it. The deep neural network combines four components: deep hierarchical feature learning occurs in the first stage, followed by multi-head transformers in the second, max-pooling and merging with bypassed information to preserve spatial relationships in the third, and a distributional dueling network in the last stage. To evaluate the method, a simulator is developed where cylinders must be found by a Kinect sensor. A network architecture study shows that deep hierarchical feature learning works for RL and that by using farthest point sampling (FPS) we can reduce the amount of points and achieve not only a reduction of the network size but also better results. We also show that multi-head attention for point-clouds helps to learn the agent faster but converges to the same outcome. Finally, we compare RL using the best network with a greedy baseline that maximizes immediate rewards and requires for that purpose an oracle that predicts the next observation. We decided RL achieves significantly better and more robust results than the greedy strategy.
Online at IEEEXplore.
«
B. Yousuf, Zs. Lendek, L. Busoniu, Multi-Agent Exploration-Based Search for an Unknown Number of Targets. In Proceedings of the 22nd IFAC World Congress (IFAC-23), Yokohama, Japan, 9–14 July 2023. »
Abstract: This paper presents an active sensor fusion technique for multiple mobile agents (robots) to detect an unknown number of static targets at unknown positions. To process and fuse sensor measurements from the agents, we use a random finite set formulation with an iterated-corrector probability hypothesis density filter. Our main contribution is to introduce two different multi-agent planners to quickly find the targets. The planners make greedy decisions for the next state of each agent by maximizing an objective function consisting of target refinement and exploration components. We demonstrate the performance of our approach through a series of simulations using homogeneous and heterogeneous agents. The results show that our framework works better than a lawnmower baseline, and that a centralized version of the planner works best.
Online at ScienceDirect.
«
B. Yousuf, Zs. Lendek, L. Busoniu, Exploration-Based Search for an Unknown Number of Targets using a UAV. In Proceedings of the 6th IFAC Conference on Intelligent Control and Automation Sciences (ICONS-22), Cluj-Napoca, Romania, 13–15 July 2022. »
Abstract: We consider a scenario in which a UAV must locate an unknown number of targets at unknown locations in a 2D environment. A random finite set formulation with a particle filter is used to estimate the target locations from noisy measurements that may miss targets. A novel planning algorithm selects a next UAV state that maximizes an objective function consisting of two components: target refinement and an exploration. Found targets are saved and then disregarded from measurements to focus on refining poorly seen targets. The desired next state is used as a reference point for a nonlinear tracking controller for the robot. Simulation results show that the method works better than lawnmower and mutual-information baselines.
Online at ScienceDirect.
«
M. Dragomir, V.M. Maer, L. Busoniu, The Co4AIR Marathon – A Matlab Simulated Drone Racing Competition. In Proceedings of the 2022 International Conference on Unmanned Aircraft Systems (ICUAS-22), pages 1219–1226, Dubrovnik, Croatia, 21–24 June 2022. »
Abstract: We describe a UAV competition concept in which a Parrot Mambo drone must race over a sequence of colored markers in minimum time. The competition is implemented in Matlab, using the Simulink Support Package for Parrot Minidrones, and can be organized fully in simulation, although an optional real-drone component is included. Students with either control or computer-science backgrounds are accommodated by providing baseline solution modules for the part outside their expertise. We present the competition design, a baseline solution, and our experience with the first edition, which was held in 2021, including student feedback and lessons learned.
Online at IEEEXplore.
«
T. Santejudean, L. Busoniu, V. Varma, C. Morarescu, A simple path-aware optimization method for mobile robots. In Proceedings of the 6th IFAC Symposium on Telematics Applications (TA-22), pages 1–6, Nancy, France, 15–17 June 2022. »
Abstract: We present an approach for a mobile robot to seek the global maximum of an initially unknown function defined over its operating space. The method exploits a Lipschitz assumption to define an upper bound on the function from previously seen samples, and optimistically moves towards the largest upper-bound point. This point is iteratively changed whenever new samples make it clear that it is suboptimal. In simulations, the method finds the global maxima with much less computation than an existing, much more involved technique, while keeping performance acceptable. Real-robot experiments confirm the effectiveness of the approach.
Online at ScienceDirect.
«
V.M. Maer, L. Tamas, L. Busoniu, Underwater robot pose estimation using acoustic methods and intermittent position measurements at the surface. In Proceedings of the 2022 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR-22), Cluj-Napoca, Romania, 19–21 May 2022. »
Abstract: Global positioning systems can provide sufficient positioning accuracy for large scale robotic tasks in open environments. However, in underwater environments, these systems cannot be directly used, and measuring the position of underwater robots becomes more difficult. In this paper we first evaluate the performance of existing pose estimation techniques for an underwater robot equipped with commonly used sensors for underwater control and pose estimation, in a simulated environment. In our case these sensors are inertial measurement units, Doppler velocity log sensors, and ultra-short baseline sensors. Secondly, for situations in which underwater estimation suffers from drift, we investigate the benefit of intermittently correcting the position using a high-precision surface-based sensor, such as regular GPS or an assisting unmanned aerial vehicle that tracks the underwater robot from above using a camera.
Online at IEEEXplore.
«
T. Santejudean, L. Busoniu, Path-aware optimistic optimization for a mobile robot. In Proceedings 60th IEEE Conference on Decision and Control (CDC-21), pages 3584–3590, Austin, US, 13–17 December 2021. »
Abstract: We consider problems in which a mobile robot samples an unknown function defined over its operating space, so as to find a global optimum of this function. The path travelled by the robot matters, since it influences energy and time requirements. We consider a branch-and-bound algorithm called deterministic optimistic optimization, and extend it to the path-aware setting, obtaining path-aware optimistic optimization (OOPA). In this new algorithm, the robot decides how to move next via an optimal control problem that maximizes the long-term impact of the robot trajectory on lowering the upper bound, weighted by bound and function values to focus the search on the optima. An online version of value iteration is used to solve an approximate version of this optimal control problem. OOPA is evaluated in extensive experiments in two dimensions, where it does better than path-unaware and local-optimization baselines.
Online at IEEEXplore.
«
I. Lal, C. Morarescu, J. Daafouz, L. Busoniu, Optimistic planning for near-optimal control of nonlinear systems with hybrid inputs. In Proceedings 60th IEEE Conference on Decision and Control (CDC-21), pages 2486–2493, Austin, US, 13–17 December 2021. »
Abstract: We propose an optimistic planning, branch-and-bound algorithm for nonlinear optimal control problems in which there is a continuous and a discrete action (input). The dynamics and rewards (negative costs) must be Lipschitz but can otherwise be general, as long as certain boundedness conditions are satisfied by the continuous action, reward, and Lipschitz constant of the dynamics. We investigate the structure of the space of hybrid-input sequences, and based on this structure we propose an optimistic selection rule for the subset with the largest upper bound on the value, and a way to select the largest-impact action for further refinement. Together, these fully define the algorithm, which we call OPHIS: optimistic planning for hybrid-input systems. A near-optimality bound is provided together with empirical results in two nonlinear problems where the algorithm is applied in receding horizon.
Online at IEEEXplore.
«
M. Granzotto, R. Postoyan, L. Busoniu, D. Nesic, J. Daafouz, Exploiting homogeneity for the optimal control of discrete-time systems: application to value iteration. In Proceedings 60th IEEE Conference on Decision and Control (CDC-21), pages 6006–6011, Austin, US, 13–17 December 2021. »
Abstract: To investigate solutions of (near-)optimal control problems, we extend and exploit a notion of homogeneity recently proposed in the literature for discrete-time systems. Assuming the plant dynamics is homogeneous, we first derive a scaling property of its solutions along rays provided the sequence of inputs is suitably modified. We then consider homogeneous cost functions and reveal how the optimal value function scales along rays. This result can be used to construct (near-)optimal inputs on the whole state space by only solving the original problem on a given compact manifold of a smaller dimension. Compared to the related works of the literature, we impose no conditions on the homogeneity degrees. We demonstrate the strength of this new result by presenting a new approximate scheme for value iteration, which is one of the pillars of dynamic programming. The new algorithm provides guaranteed lower and upper estimates of the true value function at any iteration and has several appealing features in terms of reduced computation. A numerical case study is provided to illustrate the proposed algorithm.
Online at IEEEXplore.
«
T. Natsakis, L. Busoniu, Predicting Intention of Motion During Rehabilitation Tasks of the Upper-Extremity. In 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 6037–6040, online, 1–5 November 2021. »
Abstract: Rehabilitation promoting assistance-as-needed is considered a promising scheme of active rehabilitation, since it can promote neuroplasticity faster and thus reduce the time needed until restoration. To implement such schemes using robotic devices, it is crucial to be able to predict accurately and in real-time the intention of motion of the patient. In this study, we present an intention-of-motion model trained on healthy volunteers. The model is trained using kinematics and muscle activation time series data, and returns future predicted values for the kinematics. We also present the results of an analysis of the sensitivity of the accuracy of the model for different amount of training datasets and varying lengths of the prediction horizon. We demonstrate that the model is able to predict reliably the kinematics of volunteers that were not involved in its training. The model is tested with three types of motion inspired by rehabilibation tasks. In all cases, the model is predicting the arm kinematics with a Root Mean Square Error (RMSE) below 0.12m. Being a non person-specific model, it could be used to predict kinematics even for patients that are not able to perform any motion without assistance. The resulting kinematics, even if not fully representative of the specific patient, might be a preferable input for a robotic rehabilitator than predefined trajectories currently in use.
Online at IEEEXplore.
«
M. Ndje, L. Bitjoka, A. Boum, D. Mbogne, L. Busoniu, J.C. Kamgang, G. Djogdom, Fast constrained nonlinear model predictive control for implementation on microcontrollers. In Proceedings 4th IFAC Conference on Embedded Systems, Computational Intelligence and Telematics in Control (CESCIT-21), pages 19–24, Valenciennes, France, 5–7 July 2021. »
Abstract: Model predictive control (MPC) is based on the systematic resolution of an online optimization problem at each time step. In practice, the computation cost is often very high, especially for the non-linear case under constraints, thus complicating the application of MPC to real-time systems. This paper proposes to improve the non-linear quadratic dynamic matrix control (NLQDMC) algorithm for MPC by solving constrained optimization problems only when necessary, and defaulting to the unconstrained solution whenever possible. The new algorithm is called fast NLQDMC (FNLQDMC) and is applied to the control of a nonlinear system comprised of converter and a DC machine, and implemented in a microcontroller board. The results obtained show that, depending to the setpoint profiles, this algorithm saves more than 64% computation of the constrained problem compared to the conventional NLQDMC, while keeping identical performance in terms of setpoint tracking and constraint satisfactions.
Online at ScienceDirect.
«
F. Gogianu, T. Berariu, M. Rosca, C. Clopath, L. Busoniu, R. Pascanu, Spectral normalisation for deep reinforcement learning: an optimisation perspective. In Proceedings International Conference on Machine Learning (ICML-21), pages 3734–3744, online, 18–24 July 2021. »
Abstract: Most of the recent deep reinforcement learning advances take an RL-centric perspective and focus on refinements of the training objective. We diverge from this view and show we can recover the performance of these developments not by changing the objective, but by regularising the value-function estimator. Constraining the Lipschitz constant of a single layer using spectral normalisation is sufficient to elevate the performance of a Categorical-DQN agent to that of a more elaborated agent on the challenging Atari domain. We conduct ablation studies to disentangle the various effects normalisation has on the learning dynamics and show that is sufficient to modulate the parameter updates to recover most of the performance of spectral normalisation. These findings hint towards the need to also focus on the neural component and its learning dynamics to tackle the peculiarities of Deep Reinforcement Learning.
Online at PMLR.
«
M. Granzotto, R. Postoyan, D. Nesic, L. Busoniu, J. Daafouz, When to stop value iteration: stability and near-optimality versus computation. In Proceedings of the 3rd Conference on Learning for Dynamics and Control (L4DC), pages 412–424, MIT, US, 30–31 May 2021. »
Abstract: Value iteration (VI) is a ubiquitous algorithm for optimal control, planning, and reinforcement learning schemes. Under the right assumptions, VI is a vital tool to generate inputs with desirable properties for the controlled system, like optimality and Lyapunov stability. As VI usually requires an infinite number of iterations to solve general nonlinear optimal control problems, a key question is when to terminate the algorithm to produce a 'good' solution, with a measurable impact on optimality and stability guarantees. By carefully analysing VI under general stabilizability and detectability properties, we provide explicit and novel relationships of the stopping criterion's impact on near-optimality, stability and performance, thus allowing to tune these desirable properties against the induced computational cost. The considered class of stopping criteria encompasses those encountered in the control, dynamic programming and reinforcement learning literature and it allows considering new ones, which may be useful to further reduce the computational cost while endowing and satisfying stability and near-optimality properties. We therefore lay a foundation to endow machine learning schemes based on VI with stability and performance guarantees, while reducing computational complexity.
Online at PMLR.
«
Csanad Sandor, Szabolcs Pavel, Wieser Erik, Andreea Blaga, Peter Boda, Andrea-Orsolya Fulop, Adrian Ursache, Attila Zold, Aniko Kopacz, Botond Lazar, Karoly Szabo, Zoltan Tasnadi, Botond Trinfa, Lehel Csato, Dan Marius Tegzes, Marian Leontin Pop, Raluca Alexandra Tarziu, Mihai-Valentin Zaha, Sorin Mihai Grigorescu, Lucian Busoniu, Paula Raica, Levente Tamas, The ClujUAV student competition: A corridor navigation challenge with autonomous drones. In Proceedings 21st IFAC World Congress (IFAC-20), Online (Berlin, Germany), 12–17 July 2020. »
Abstract: We describe a novel student contest concept in which an unmanned aerial vehicle (UAV or drone) must autonomously navigate a straight corridor using feedback from camera images. The objective of the contest is to promote engineering skills (related to sensing and control in particular) among students and young professionals, by means of an attractive robotics topic in an exciting competition format. The first edition of this contest was organized in Cluj-Napoca, Romania on October 19th 2019. Teams from industry and academia competed, with an overall positive experience. We outline the challenge and scoring rules, together with the technical solutions of the teams, and close with a summary of the results and points to improve for the next editions.
Online at ScienceDirect.
«
K. Mirkamali, L. Busoniu, Cross Entropy Optimization of Action Modification Policies for Continuous-Valued MDPs. In Proceedings 21st IFAC World Congress (IFAC-20), Online (Berlin, Germany), 12–17 July 2020. »
Abstract: We propose an algorithm to search for parametrized policies in continuous state and action Markov Decision Processes (MDPs). The policies are represented via a number of basis functions, and the main novelty is that each basis function corresponds to a small, discrete modification of the continuous action. In each state, the policy chooses a discrete action modification associated with a basis function having the maximum value at the current state. Empirical returns from a representative set of initial states are estimated in simulations to evaluate the policies. Instead of using slow gradient-based algorithms, we apply cross entropy method for updating the parameters. The proposed algorithm is applied to a double integrator and an inverted pendulum problem, with encouraging results.
Online at ScienceDirect.
«
I. Lal, A. Codrean, L. Busoniu, Sliding mode control of a ball balancing robot. In Proceedings 21st IFAC World Congress (IFAC-20), Online (Berlin, Germany), 12–17 July 2020. »
Abstract: This paper presents a sliding mode control design for a ball-balancing robot (ballbot), with associated real-time results. The sliding mode control is designed based on the linearized plant model, and is robust to matched uncertainties. The design is considerably simpler than other nonlinear control strategies presented in the literature, and the experimental results for stabilization and tracking show much better performances than those obtained with linear control (in particular, a linear quadratic regulator).
Online at ScienceDirect.
«
Z. Nagy, Zs. Lendek, L. Busoniu, Control and Estimation for Mobile Sensor-Target Problems with Distance-Dependent Noise. In Proceedings IEEE American Control Conference (ACC-20), Online (Denver, Colorado), 1–3 July 2020.
M. Granzotto, R. Postoyan, L. Busoniu, D. Nesic, J. Daafouz, Optimistic planning for the near-optimal control of nonlinear switched discrete-time systems with stability guarantees. In Proceedings 58th IEEE Conference on Decision and Control (CDC-19), Nice, France, 11–13 December 2019. »
Abstract: Originating in the artificial intelligence literature, optimistic planning (OP) is an algorithm that generates near-optimal control inputs for generic nonlinear discrete-time systems whose input set is finite. This technique is therefore relevant for the near-optimal control of nonlinear switched systems, for which the switching signal is the control. However, OP exhibits several limitations, which prevent its application in a standard control context. First, it requires the stage cost to take values in [0, 1], an unnatural prerequisite as it excludes, for instance, quadratic stage costs. Second, it requires the cost function to be discounted. Third, it applies for reward maximization, and not cost minimization. In this paper, we modify OP to overcome these limitations, and we call the new algorithm OPmin. We then make stabilizability and detectability assumptions, under which we derive near-optimality guarantees for OPmin and we show that the obtained bound has major advantages compared to the bound originally given by OP. In addition, we prove that a system whose inputs are generated by OPmin in a receding-horizon fashion exhibits stability properties. As a result, OPmin provides a new tool for the near-optimal, stable control of nonlinear switched discrete-time systems for generic cost functions.
Online at IEEEXplore.
«
R. Postoyan, M. Granzotto, L. Busoniu, B. Scherrer, D. Nesic, J. Daafouz, Stability guarantees for nonlinear discrete-time systems controlled by approximate value iteration. In Proceedings 58th IEEE Conference on Decision and Control (CDC-19), Nice, France, 11–13 December 2019. »
Abstract: Value iteration is a method to generate optimal control inputs for generic nonlinear systems and cost functions. Its implementation typically leads to approximation errors, which may have a major impact on the closed-loop system performance. We talk in this case of approximate value iteration (AVI). In this paper, we investigate the stability of systems for which the inputs are obtained by AVI. We consider deter-ministic discrete-time nonlinear plants and a class of general, possibly discounted, costs. We model the closed-loop system as a family of systems parameterized by tunable parameters, which are used for the approximation of the value function at different iterations, the discount factor and the iteration step at which we stop running the algorithm. It is shown, under natural stabilizability and detectability properties as well as mild conditions on the approximation errors, that the family of closed-loop systems exhibit local practical stability properties. The analysis is based on the construction of a Lyapunov function given by the sum of the approximate value function and the Lyapunov-like function that characterizes the detectability of the system. By strengthening our conditions, asymptotic and exponential stability properties are guaranteed.
Online at IEEEXplore.
«
L. Busoniu, J. Daafouz, C. Morarescu, Near-optimal control of nonlinear systems with simultaneous controlled and random switches. In 5th IFAC Conference on Intelligent Control and Automation Sciences (ICONS-19), pages 268–273, Belfast, Northern Ireland, 21–23 August 2019. »
Abstract: We consider dual switched systems, in which two switching signals act simultaneously to select the dynamical mode. The first signal is controlled and the second is random, with probabilities that evolve either periodically or as a function of the dwell time. We formalize both cases as Markov decision processes, which allows them to be solved with a simple approximate dynamic programming algorithm. We illustrate the framework in a problem where the random signal is a delay on the control channel that is used to send the controlled signal to the system.
Online at ScienceDirect.
«
G. Feng, T.M. Guerra, A.T. Nguyen, L. Busoniu, S. Mohammad, Robust Observer-Based Tracking Control Design for Power-Assisted Wheelchairs. In 5th IFAC Conference on Intelligent Control and Automation Sciences (ICONS-19), pages 61–66, Belfast, Northern Ireland, 21–23 August 2019. »
Abstract: Power-assisted wheelchairs (PAW) are efficient means of transportation for disabled persons. The resulting human-machine system includes several unknown parameters such as the mass of the user or ground adhesion. Moreover, the torque signals produced by the human are required to design a robust assistive strategy, but measuring them with torque sensors increases significantly the cost of the system. Therefore, we propose a robust observer-based assistive controller using a polytopic representation. The closed-loop control design is composed of two elements: a state feedback controller with the full state at the input, and an unknown input observer to estimate human torques and feed them into the obtained controller. The goal is to guarantee an imposed H? estimation performance while achieving reference tracking. To achieve the predefined performance, the observer gains are computed by solving an LMI problem. Finally, simulation results validate the control design. The methodology follows patent WO2015173094 issued in 2015 (Mohammad et al. 2015).
Online at ScienceDirect.
«
A.-D. Mezei, L. Tamas, L. Busoniu, Sorting objects from a conveyor belt using active perception with a POMDP model. In 18th IEEE European Control Conference (ECC-19), pages 2466–2471, Napoli, Italy, 25–28 June 2019. »
Abstract: We consider an application where a robot must sort objects traveling on a conveyor belt into different classes. The detector and classifier work on 3D point clouds, but are of course not fully accurate, so they sometimes misclassify objects. We describe this task using a novel model in the formalism of partially observable Markov decision processes. With the objective of finding the correct classes with a small number of observations, we then apply a state-of-the-art POMDP solver to plan a sequence of observations from different viewpoints, as well as the moments when the robot decides the class of the current object (which automatically triggers sorting and moving the conveyor belt). In a first version, observations are carried out only for the object at the end of the conveyor belt, after which we extend the framework to observe multiple objects. The performance with both versions is analyzed in simulations, in which we study the ratio of correct to incorrect classifications and the total number of steps to sort a batch of objects. Real-life experiments with a Baxter robot are then provided with publicly shared code and data at http://community.clujit.ro/display/TEAM/Active+perception.
Online at IEEEXplore.
«
I. Lal, M. Nicoara, A. Codrean, L. Busoniu, Hardware and Control Design of a Ball Balancing Robot. In 22nd IEEE International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS-19), Cluj-Napoca, Romania, 24–26 April 2019. »
Abstract: This paper presents the construction of a new ball balancing robot (ballbot), together with the design of a controller to balance it vertically around a given position in the plane. Requirements on physical size and agility lead to the choice of ball, motors, gears, omnidirectional wheels, and body frame. The electronic hardware architecture is presented in detail, together with timing results showing that real-time control can be achieved. Finally, we design a linear quadratic regulator for balancing, starting from a 2D model of the robot. Experimental balancing results are satisfactory, maintaining the robot in a disc 0.3 m in diameter.
Online at IEEEXplore.
«
M. Granzotto, R. Postoyan, L. Busoniu, D. Nesic, J. Daafouz, Stability analysis of discrete-time finite-horizon discounted optimal control. In 57th IEEE Conference on Decision and Control (CDC-18), pages 2322–2327, Miami, USA, 17–19 December 2018. »
Abstract: Discounted costs are considered in many fields, like reinforcement learning, for which various algorithms can be used to obtain optimal inputs for finite horizons. The related literature mostly concentrates on optimality and largely ignores stability. In this context, we study stability of general nonlinear discrete-time systems controlled by an optimal sequence of inputs that minimizes a finite-horizon discounted cost computed in a receding horizon fashion. Assumptions are made related to the stabilizability of the system and its detectability with respect to the stage cost. Then, a Lyapunov function for the closed-loop system with the receding horizon controller is constructed and a uniform semiglobal stability property is ensured, where the adjustable parameters are both the discount factor and the horizon length. Uniform global exponential stability is guaranteed by strengthening the initial assumptions, in which case explicit bounds on the discount factor and the horizon length are provided. We compare the obtained bounds in the particular cases where there is no discount or the horizon is infinite, respectively, with related results in the literature and we show our bounds improve existing ones on the examples considered.
Online at IEEEXplore.
«
C. Morarescu, V. Varma, L. Busoniu, S. Lasaulce, Space-time budget allocation for marketing over social networks. In Proceedings IFAC Conference on Analysis and Design of Hybrid Systems (ADHS-18), pages 211–216, Oxford, UK, 11–13 July 2018. »
Abstract: We address formally the problem of opinion dynamics when the agents of a social network (e.g., consumers) are not only influenced by their neighbors but also by an external influential entity referred to as a marketer. The influential entity tries to sway the overall opinion to its own side by using a specific influence budget during discrete-time advertising campaigns; consequently, the overall closed-loop dynamics becomes a linear-impulsive (hybrid) one. The main technical issue addressed is finding how the marketer should allocate its budget over time (through marketing campaigns) and over space (among the agents) such that the agents' opinion be as close as possible to a desired opinion; for instance, the marketer may prioritize certain agents over others based on their influence in the social graph. The corresponding space-time allocation problem is formulated and solved for several special cases of practical interest. Valuable insights can be extracted from our analysis. For instance, for most cases we prove that the marketer has an interest in investing most of its budget at the beginning of the process and that budget should be shared among agents according to the famous water-filling allocation rule. Numerical examples illustrate the analysis.
Online at ScienceDirect.
«
L. Busoniu, V. S. Varma, I.-C. Morarescu, S. Lasaulce, Learning-based control for a communicating mobile robot under unknown rates. In IEEE American Control Conference (ACC-19), pages 267–272, Philadelphia, USA, 10–12 July 2019. »
Abstract: In problems such as surveying or monitoring remote regions, a mobile robot must transmit data over a wireless network with unknown, position-dependent transmission rates. We propose an algorithm to achieve this objective that learns approximations of the rate function and of an optimal-control solution that transmits the data in minimum time. The rates are estimated with supervised learning from the samples observed; and the control is found with dynamic programming sweeps around the current state of the robot that exploit the rate function estimate, combined with online reinforcement learning. For both synthetic and realistic rate functions, our experiments show that the learning algorithm empties the data buffer in less than twice the number of steps achieved by a model-based solution that requires to perfectly know the rate function.
Online at IEEEXplore.
«
G. Feng, L. Busoniu, T.M. Guerra, S. Mohammad, Reinforcement Learning for Energy Optimization Under Human Fatigue Constraints of Power-Assisted Wheelchairs. In IEEE American Control Conference (ACC-18), Milwaukee, USA, 27–29 June 2018. »
Abstract: In the last decade, Power-Assisted Wheelchairs (PAWs) have been widely used for improving the mobility of disabled persons. The main advantage of PAWs is that users can keep a suitable physical activity. Moreover, the metabolic-electrical energy hybridization of PAWs provides more flexibility for optimal control design. In this context, we propose an optimal control for minimizing the electrical energy consumption under human fatigue constraints, including a human fatigue model. The electrical motor has to cooperate with the user over a given distance-to-go. As the human fatigue model is unknown in practice, we use model-free Policy Gradient methods to directly learn controllers for a given driving task. We verify that the model-free solution is near-optimal by computing the model-based controller, which is generated by Approximate Dynamic Programming. Simulation results confirm that the model-free Policy Gradient method provides near-optimal solutions.
Online at IEEEXplore.
«
G. Feng, T.M. Guerra, S. Mohammad, L. Busoniu, Observer-Based Assistive Control Design Under Time-Varying Sampling for Power-Assisted Wheelchairs. In IFAC Conference on Embedded Systems, Computational Intelligence and Telematics in Control (CESCIT-18), Faro, Portugal, 6–8 June 2018. »
Abstract: Compared to manual wheelchairs and fully electric powered wheelchairs, power-assisted wheelchairs (PAWs) provide a special structure where the human can use her/his propulsion to interact with the assistive system. In this context, different studies have focused on the assistive control of PAWs in recent years. This paper presents an observed-based assistive control design using only position encoders. With a time-varying sampling induced by these position encoders, the wheelchair is described by a discrete-time Linear Parameter Varying model. Based on a Takagi-Sugeno (TS) representation, an observer is designed by using LMI techniques. According to the estimated human torques, we use the frequencies with which the wheels are pushed to compute the reference velocity of the centre of gravity. The wheelchair turns with a constant yaw velocity when one of two wheels is braked by the human. Reference tracking is accomplished by a PI controller. Simulation results confirm that the proposed assistive control algorithm provides a good maneuverability for users to control the velocity of the centre of gravity and the yaw velocity of the wheelchair.
Online at ScienceDirect.
«
C. Iuga, P. Dragan, L. Busoniu, Fall monitoring and detection for at-risk persons using a UAV. In IFAC Conference on Embedded Systems, Computational Intelligence and Telematics in Control (CESCIT-18), Faro, Portugal, 6–8 June 2018. »
Abstract: We describe a demonstrator application that uses a UAV to monitor and detect falls of an at-risk person. The position and state (upright or fallen) of the person are determined with deep-learning-based computer vision, where existing network weights are used for position detection, while for fall detection the last layer is fine-tuned in additional training. A simple visual servoing control strategy keeps the person in view of the drone, and maintains the drone at a set distance from the person. In experiments, falls were reliably detected, and the algorithm was able to successfully track the person indoors.
Online at ScienceDirect.
«
G. Feng, T.M. Guerra, L. Busoniu, S. Mohammad, Unknown input observer in descriptor form via LMIs for power-assisted wheelchairs. In 36th Chinese Control Conference (CCC-17), Dalian, China, 26–28 July 2017. »
Abstract: Power-assisted wheelchairs (PAW) provide an efficient means of transport for disabled persons. In this human-machine interaction, the human-applied torque is a crucial variable to implement the assistive system. The present paper describes a novel scheme to design PAWs without torque sensors. Instead of using a torque sensor, a discrete-time unknown input observer in descriptor form is applied to estimate the human input torque and the angular velocities of the two wheels via the angular position. Using Finsler's lemma, the observer gains are obtained by solving an LMI problem. Based on the estimation, both a torque-assistance system and a speed controller are introduced. In addition, the Input-to-State Stability (ISS) of the interconnected controller-observer system is analysed for the speed controller. Finally, simulation results validate the observer and the power-assisted algorithms. The methodology follows patent WO2015173094 issued in 2015.
Online at IEEEXplore.
«
J. Xu, L. Busoniu, B. De Schutter, Near-Optimal Control with Adaptive Receding Horizon for Discrete-Time Piecewise Affine Systems. In Proceedings 20th IFAC World Congress (IFAC-17), pages 4168–4173, Toulouse, France, 9–14 July 2017. »
Abstract: We consider the infinite-horizon optimal control of discrete-time, Lipschitz continuous piecewise affine systems with a single input. Stage costs are discounted, bounded, and use a 1 or infinity-norm. Rather than using the usual fixed-horizon approach from model-predictive control, we tailor an adaptive-horizon method called optimistic planning for continuous actions (OPC) to solve the piecewise affine control problem in receding horizon. The main advantage is the ability to solve problems requiring arbitrarily long horizons. Furthermore, we introduce a novel extension that provides guarantees on the closed-loop performance, by reusing data (learning) across different steps. This extension is general and works for a large class of nonlinear dynamics. In experiments with piecewise affine systems, OPC improves performance compared to a fixed-horizon approach, while the data-reuse approach yields further improvements.
Online at ScienceDirect.
«
S. Sabau, I.-C. Morarescu, L. Busoniu, A. Jadbabaie, Decoupled-Dynamics Distributed Control for Strings of Nonlinear Autonomous Agents. In IEEE American Control Conference (ACC-17), Seattle, USA, 24–26 May 2017. »
Abstract: We introduce a novel distributed control architecture for a class of nonlinear dynamical agents moving in the 'string' formation, while guaranteeing trajectory tracking and collision avoidance. Each autonomous agent uses information and relative measurements only with respect to its predecessor in the string. The performance of the scheme is entirely scalable with respect to the number of agents in formation. The scalability is a consequence of the “decoupling” of a certain bounded approximation of the closed–loop equations, entailing that individual, local analyses of the closed–loops stability at each agent will in turn guarantee the aggregated stability of the entire formation. An efficient, practical method for compensating communications induced delays is also presented.
Online at IEEEXplore.
«
J. Ben Rejeb, L. Busoniu, I.-C. Morarescu, J. Daafouz, Near-Optimal Control of Nonlinear Switched Systems with Non-Cooperative Switching Rules. In IEEE American Control Conference (ACC-17), Seattle, USA, 24–26 May 2017. »
Abstract: This paper presents a predictive, planning algorithm for nonlinear switched systems where there are two switching signals, one controlled and the other uncontrolled, both subject to constraints on the dwell time after a switch. The algorithm solves a minimax problem where the controlled signal is chosen to optimize a discounted sum of rewards, while taking into account the worst possible uncontrolled switches. It is an extension of a classical minimax search method, so we call it optimistic minimax search with dwell time constraints, OMSdelta. For any combination of dwell times, OMSdelta returns a sequence of switches that is provably near-optimal, and can be applied in receding horizon for closed loop control. For the case when the two dwell times are the same, we provide a convergence rate to the minimax optimum as a function of the computation invested, modulated by a measure of problem complexity. We show how the framework can be used to model switched systems with time delays on the control channel, and provide an illustrative simulation for such a system with nonlinear modes.
Online at IEEEXplore.
«
E. Pall, L. Tamas, L. Busoniu, Analysis and a Home Assistance Application of Online AEMS2 Planning. In Proceedings IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS-16), Daejeon, Korea, 9–14 October 2016. »
Abstract: We consider an online planning algorithm for partially observable Markov decision processes (POMDPs), called Anytime Error Minimization Search 2 (AEMS2). Despite the considerable success it has enjoyed in robotics and other problems, no quantitative analysis exists of the relationship between its near-optimality and the computation invested. Exploiting ideas from fully-observable MDP planning, we provide here such an analysis, in which the relationship is modulated via a measure of problem complexity called near-optimality exponent. We illustrate the exponent for some interesting POMDP structures, and examine the role of the informative heuristics used by AEMS2 in the guarantees. In the second part of the paper, we introduce a domestic assistance problem in which a robot monitors partially observable switches and turns them off if needed. AEMS2 successfully solves this task in real experiments, and also works better than several state of the art planners in simulation comparisons.
Online at IEEEXplore.
«
J. Xu, T. van den Boom, L. Busoniu, B. De Schutter, Model Predictive Control for Continuous Piecewise Affine Systems Using Optimistic Optimization. In Proceedings IEEE American Control Conference (ACC-16), Boston, USA, 6–8 July 2016. »
Abstract: This paper considers model predictive control for continuous piecewise affine (PWA) systems. In general, this leads to a nonlinear, nonconvex optimization problem. We introduce an approach based on optimistic optimization to solve the resulting optimization problem. Optimistic optimization is based on recursive partitioning of the feasible set and is characterized by an efficient exploration strategy seeking for the optimal solution. The advantage of optimistic optimization is that one can guarantee bounds on the suboptimality with respect to the global optimum for a given computational budget. The 1-norm and infty-norm objective functions often considered in model predictive control for continuous PWA systems are continuous PWA functions. We derive expressions for the core parameters required by optimistic optimization for the resulting optimization problem. By applying optimistic optimization, a sequence of control inputs is designed satisfying linear constraints. A bound on the suboptimality of the returned solution is also discussed. The performance of the proposed approach is illustrated with a case study on adaptive cruise control. «
L. Busoniu, E. Pall, R. Munos, Discounted Near-Optimal Control of General Continuous-Action Nonlinear Systems Using Optimistic Planning. In Proceedings IEEE American Control Conference (ACC-16), Boston, USA, 6–8 July 2016. »
Abstract: We propose an optimistic planning method to search for near-optimal sequences of actions in discrete-time, infinite-horizon optimal control problems with discounted rewards. The dynamics are general nonlinear, while the action (input) is scalar and compact. The method works by iteratively splitting the infinite-dimensional search space into hyperboxes. Under appropriate conditions on the dynamics and rewards, we analyze the shrinking rate of the range of possible values in each box. When coupled with a measure of problem complexity, this leads to an overall convergence rate of the algorithm to the infinite-horizon optimum, as a function of computation invested. We provide simulation results showing that the algorithm is useful in practice, and comparing it with two alternative planning methods. «
K. Mathe, L. Busoniu, L. Barabas, L. Miclea, J. Braband, C. Iuga, Vision-Based Control of a Quadrotor for an Object Inspection Scenario. In Proceedings 2016 International Conference on Unmanned Aircraft Systems (ICUAS-16), pages 849–857, Arlington, USA, 7–10 June 2016. »
Abstract: Unmanned aerial vehicles (UAVs) have gained special attention in recent years, among others in monitoring and inspection applications. In this paper, a less explored application field is proposed, railway inspection, where UAVs can be used to perform visual inspection tasks such as semaphore, catenary, or track inspection. We focus on lightweight UAVs, which can detect many events in railways (for example missing indicators or cabling, or obstacles on the tracks). An outdoor scenario is developed where a quadrotor visually detects a railway semaphore and flies around it autonomously, recording a video of it for offline post-processing. For these tasks, we exploit object detection methods from literature, and develop a visual servoing technique. Additionally, we perform a thorough comparison of several object detection approaches before selecting a preferred method. Then, we show the performance of the presented filtering solutions when they are used in servoing, and conclude our experiments with evaluating real outdoor flight trajectories using an AR.Drone 2.0 quadrotor.
Online at IEEEXplore.
«
J. Xu, L. Busoniu, T. van den Boom, B. De Schutter, Receding-Horizon Control for Max-Plus Linear Systems with Discrete Actions Using Optimistic Planning. In Proceedings 13th International Workshop on Discrete Event Systems (WODES-16), pages 398–403, Xi'an, China, 30 May – 1 June 2016. »
Abstract: This paper addresses the infinite-horizon optimal control problem for max-plus linear systems where the considered objective function is a sum of discounted stage costs over an infinite horizon. The minimization problem of the cost function is equivalently transformed into a maximization problem of a reward function. The resulting optimal control problem is solved based on an optimistic planning algorithm. The control variables are the increments of system inputs and the action space is discretized as a finite set. Given a finite computational budget, a control sequence is returned by the optimistic planning algorithm. The first control action or a subsequence of the returned control sequence is applied to the system and then a receding-horizon scheme is adopted. The proposed optimistic planning approach allows us to limit the computational budget and also yields a characterization of the level of near-optimality of the resulting solution. The effectiveness of the approach is illustrated with a numerical example. The results show that the optimistic planning approach results in a lower tracking error compared with a finite-horizon approach when a subsequence of the returned control sequence is applied.
Online at IEEEXplore.
«
L. Busoniu, M.-C. Bragagnolo, J. Daafouz, C. Morarescu, Planning Methods for the Optimal Control and Performance Certification of General Nonlinear Switched Systems. In Proceedings 54th IEEE Conference on Decision and Control (CDC-15), Osaka, Japan, 15–18 December 2015. »
Abstract: We consider two problems for discrete-time switched systems with autonomous, general nonlinear modes. The first is optimal control of the switches so as to minimize the discounted infinite-horizon sum of the costs. The second problem occurs when switches are a disturbance, and the worstcase cost under any sequence of switches is sought. We use an optimistic planning (OP) algorithm that can solve general optimal control with discrete inputs such as switches. We extend the analysis of OP to provide sequences of switches with certification (upper and lower) bounds on the optimal and worst-case costs, and to characterize the convergence rate of the gap between these bounds. Since a minimum dwell time between switches must often be ensured, we introduce a new optimistic planning variant that can handle this case, and analyze its convergence rate. Simulations for linear and nonlinear modes illustrate that the approach works in practice.
Online at IEEEXplore.
«
T. Wensveen, L. Busoniu, R. Babuska, Real-Time Optimistic Planning with Action Sequences. In Proceedings 20th International Conference on Control Systems and Computer Science (CSCS-15), pages 923–930, Bucharest, Romania, 27–29 May 2015. »
Abstract: Optimistic planning (OP) is a promising approach for receding-horizon optimal control of general nonlinear systems. This generality comes however at large computational costs, which so far have prevented the application of OP to the control of nonlinear physical systems in real-time. We therefore introduce an extension of OP to real-time control, which applies open-loop sequences of actions in parallel with finding the next sequence from the predicted state at the end of the current sequence. Exploiting OP guarantees, we provide conditions under which the algorithm is provably feasible in real-time, and we analyze its performance. We report successful real-time experiments for the swingup of an inverted pendulum, as well as simulation results for an acrobot, where the impact of model errors is studied.
Online at IEEEXplore.
«
R. Postoyan, L. Busoniu, D. Nesic, J. Daafouz, Stability of Infinite-Horizon Optimal Control with Discounted Cost. In Proceedings 53rd IEEE Conference on Decision and Control (CDC-14), pages 3903–3908, Los Angeles, USA, 15–17 December 2014. »
Abstract: We investigate the stability of general nonlinear discrete-time systems controlled by an optimal sequence of inputs that minimizes an infinite-horizon discounted cost. We first provide conditions under which a global asymptotic stability property is ensured for the corresponding undiscounted problem. We then show that this property is semiglobally and practically preserved in the discounted case, where the adjustable parameter is the discount factor. We then focus on a scenario where the stage cost is bounded and we explain how our framework applies to guarantee stability in this case. Finally, we provide sufficient conditions, including boundedness of the stage cost, under which the value function, which serves as a Lyapunov function for the analysis, is continuous. As already shown in the literature, the continuity of the Lyapunov function is crucial to ensure some nominal robustness for the closed-loop system.
Online at IEEEXplore.
«
K. Mathe, L. Busoniu, R. Munos, B. De Schutter, Optimistic Planning with a Limited Number of Action Switches for Near-Optimal Nonlinear Control. In Proceedings 53rd IEEE Conference on Decision and Control (CDC-14), pages 3518–3523, Los Angeles, USA, 15–17 December 2014. »
Abstract: We consider infinite-horizon optimal control of nonlinear systems where the actions (inputs) are discrete. With the goal of limiting computations, we introduce a search algorithm for action sequences constrained to switch at most a given number of times between different actions. The new algorithm belongs to the optimistic planning class originating in artificial intelligence, and is called optimistic switch-limited planning (OSP). It inherits the generality of the OP class, so it works for nonlinear, nonsmooth systems with nonquadratic costs. We develop analysis showing that the switch constraint leads to polynomial complexity in the search horizon, in contrast to the exponential complexity of state-of-the-art OP; and to a correspondingly faster convergence. The degree of the polynomial varies with the problem and is a meaningful measure for the difficulty of solving it. We study this degree in two representative, opposite cases. In simulations we first apply OSP to a problem where limited-switch sequences are near-optimal, and then in a networked control setting where the switch constraint must be satisfied in closed loop.
Online at IEEEXplore.
«
L. Busoniu, R. Munos, Elod Pall, An Analysis of Optimistic, Best-First Search for Minimax Sequential Decision Making. In Proceedings IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-14), pages 1–8, Orlando, USA, 10–13 December 2014. »
Abstract: We consider problems in which a maximizer and a minimizer agent take actions in turn, such as games or optimal control with uncertainty modeled as an opponent. We extend the ideas of optimistic optimization to this setting, obtaining a search algorithm that has been previously considered as the best-first search variant of the B* method. We provide a novel analysis of the algorithm relying on a certain structure for the values of action sequences, under which earlier actions are more important than later ones. An asymptotic branching factor is defined as a measure of problem complexity, and it is used to characterize the relationship between computation invested and near-optimality. In particular, when action importance decreases exponentially, convergence rates are obtained. Throughout, examples illustrate analytical concepts such as the branching factor. In an empirical study, we compare the optimistic best-first algorithm with two classical game tree search methods, and apply it to a challenging HIV infection control problem.
Online at IEEEXplore.
«
L. Busoniu, L. Tamas, Optimistic Planning for the Near-Optimal Control of General Nonlinear Systems with Continuous Transition Distributions. In Proceedings 19th IFAC World Congress (IFAC-14), pages 1910–1915, Cape Town, South Africa, 24–29 August 2014. »
Abstract: Optimistic planning is an optimal control approach from artificial intelligence, which can be applied in receding horizon. It works for very general nonlinear dynamics and cost functions, and its analysis establishes a tight relationship between computation invested and near-optimality. However, there is no optimistic planning algorithm that searches for closed-loop solutions in stochastic problems with continuous transition distributions. Such transitions are essential in control, where they arise e.g. due to continuous disturbances. Existing algorithms only search for open-loop input sequences, which are suboptimal. We therefore propose a closed-loop algorithm that discretizes the continuous transition distribution into sigma points, and call it sigma-optimistic planning. Assuming the error introduced by sigma-point discretization is bounded, we analyze the solution returned, showing that it is near-optimal. The algorithm is evaluated in simulation experiments, where it performs better than a state-of-the-art open-loop planning technique; a certainty-equivalence approach also works well.
Online at ScienceDirect.
«
E. Pall, K. Mathe, L. Tamas, L. Busoniu, Railway Track Following with the AR.Drone Using Vanishing Point Detection. In Proceedings 2014 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR-14), Cluj-Napoca, Romania, 22–24 May 2014. »
Abstract: Unmanned aerial vehicles are increasingly being used and showing their advantages in many domains. However, their application to railway systems is very little studied. In this paper, we focus on controlling an AR.Drone UAV in order to follow the railway track. The method developed relies on vision-based detection and tracking of the vanishing point of the railway tracks, overhead lines, and other related lines in the image, coupled with a controller that adjusts the yaw so as to keep the vanishing point in the center of the image. Simulation results illustrate the method is effective, and are complemented by vanishing-point tracking results on real images. «
K. Mathe, L. Busoniu, L. Miclea, Optimistic Planning with Long Sequences of Identical Actions for Near-Optimal Nonlinear Control. In Proceedings 2014 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR-14), Cluj-Napoca, Romania, 22–24 May 2014. »
Abstract: Optimistic planning for deterministic systems (OPD) is an algorithm able to find near-optimal control for very general, nonlinear systems. OPD iteratively builds near-optimal sequences of actions by always refining the most promising sequence; this is done by adding all possible one-step actions. However, OPD has large computational costs, which might be undesirable in real life applications. This paper proposes an adaptation of OPD for a specific subclass of control problems where control actions do not change often (e.g. bang-bang, time-optimal control). The new algorithm is called Optimistic Planning with K identical actions (OKP), and it refines sequences by adding, in addition to one-step actions, also repetitions of each action up to K times. Our analysis proves that the a posteriori performance guarantees are similar to those of OPD, improving with the length of the explored sequences, though the asymptotic behaviour of OKP cannot be formally predicted a priori. Simulations illustrate that for properly chosen parameter K, in a control problem from the class considered, OKP outperforms OPD. «
L. Busoniu, C. Morarescu, Consensus for Agents with General Dynamics Using Optimistic Optimization. In Proceedings 2013 Conference on Decision and Control (CDC-13), Florence, Italy, 10–13 December 2013. »
Abstract: An important challenge in multiagent systems is consensus, in which the agents must agree on certain controlled variables of interest. So far, most consensus algorithms for agents with nonlinear dynamics exploit the specific form of the nonlinearity. Here, we propose an approach that only requires a black-box simulation model of the dynamics, and is therefore applicable to a wide class of nonlinearities. This approach works for agents communicating on a fixed, connected network. It designs a reference behavior with a classical consensus protocol, and then finds control actions that drive the nonlinear agents towards the reference states, using a recent optimistic optimization algorithm. By exploiting the guarantees of optimistic optimization, we prove that the agents achieve practical consensus. A representative example is further analyzed, and simulation results on nonlinear robotic arms are provided.
This is an extended version with a more detailed proof of the main result.
«
L. Busoniu, R. Postoyan, J. Daafouz, Near-Optimal Strategies for Nonlinear Networked Control Systems Using Optimistic Planning. In Proceedings 2013 American Control Conference (ACC-13), Washington DC, US, 17–19 June 2013. »
Abstract: We consider the scenario where a controller communicates with a general nonlinear plant via a network, and must optimize a performance index. The problem is modeled in discrete time and the admissible control inputs are constrained to belong to a finite set. Exploiting a recent optimistic planning algorithm from the artificial intelligence field, we propose two control strategies that take into account communication constraints induced by the use of the network. Both resulting algorithms have guaranteed near-optimality. In the first strategy, input sequences are transmitted to the plant at a fixed period, and we show bounded computation. In the second strategy, the algorithm decides the next transmission instant according to the last state measurement (leading to a self-triggered policy), working within a fixed computation budget. For this case, we guarantee long transmission intervals. Examples and simulation experiments are provided throughout the paper to illustrate the results. «
L. Busoniu, C. Morarescu, Optimistic Planning for Consensus. In Proceedings 2013 American Control Conference (ACC-13), Washington DC, US, 17–19 June 2013. »
Abstract: An important challenge in multiagent systems is consensus, in which the agents are required to synchronize certain controlled variables of interest, often using only an incomplete and time-varying communication graph. We propose a consensus approach based on optimistic planning (OP), a predictive control algorithm that finds near-optimal control actions for any nonlinear dynamics and reward (cost) function. At every step, each agent uses OP to solve a local control problem with rewards that express the consensus objectives. Neighboring agents coordinate by exchanging their predicted behaviors in a predefined order. Due to its generality, OP consensus can adapt to any agent dynamics and, by changing the reward function, to a variety of consensus objectives. OP consensus is demonstrated for velocity consensus (flocking) with a time-varying communication graph, where it preserves connectivity better than a classical algorithm; and for leaderless and leader-based consensus of robotic arms, where OP easily deals with the nonlinear dynamics. «
R. Fonteneau, L. Busoniu, R. Munos, Optimistic Planning for Belief-Augmented Markov Decision Processes. In Proceedings 2013 Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-13), Singapore, 15–19 April 2013. »
Abstract: This paper presents the Bayesian Optimistic Planning (BOP) algorithm, a novel model-based Bayesian reinforcement learning approach. BOP extends the planning approach of the Optimistic Planning for Markov Decision Processes (OP-MDP) algorithm to contexts where the transition model of the MDP is initially unknown and progressively learned through interactions within the environment. The knowledge about the unknown MDP is represented with a probability distribution over all possible transition models using Dirichlet distributions, and the BOP algorithm plans in the belief-augmented state space constructed by concatenating the original state vector with the current posterior distribution over transition models. We show that BOP becomes Bayesian optimal when the budget parameter increases to infinity. Preliminary empirical validations show promising performance. «
L. Busoniu, A. Daniels, R. Munos, R. Babuska, Optimistic Planning for Continuous-Action Deterministic Systems. In Proceedings 2013 Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-13), Singapore, 15–19 April 2013. »
Abstract: We consider the class of online planning algorithms for optimal control, which compared to dynamic programming are relatively unaffected by large state dimensionality. We introduce a novel planning algorithm called SOOP that works for deterministic systems with continuous states and actions. SOOP is the first method to explore the true solution space, consisting of infinite sequences of continuous actions, without requiring knowledge about the smoothness of the system. SOOP can be used parameter-free at the cost of more model calls, but we also propose a more practical variant tuned by a parameter alpha, which balances finer discretization with longer planning horizons. Experiments on three problems show SOOP reliably ranks among the best algorithms, fully dominating competing methods when the problem requires both long horizons and fine discretization. «
I. Grondman, L. Busoniu, R. Babuska, Model-Learning Actor-Critic Algorithms: Performance Evaluation in a Motion Control Task. In Proceedings 51st IEEE Conference on Decision and Control (CDC-12), pages 5272–5277, Maui, Hawaii, 10–13 December 2012. »
Abstract: Reinforcement learning (RL) control provides a means to deal with uncertainty and nonlinearity associated with control tasks in an optimal way. The class of actor–critic RL algorithms proved useful for control systems with continuous state and input variables. In the literature, model-based actor–critic algorithms have recently been introduced to considerably speed up the the learning by constructing on-line a model through local linear regression (LLR). However, it has not been analyzed whether the speed-up is due to the model-learning structure or the LLR approximator. Therefore, in this paper we generalize the model-learning actor–critic algorithms to make them suitable for use with an arbitrary function approximator. Furthermore, we present the results of an extensive analysis through numerical simulations of a typical nonlinear motion control problem. The LLR approximator is compared with radial basis functions (RBFs) in terms of the initial convergence rate and in terms of the final performance obtained. The results show that LLR-based actor–critic RL outperforms the RBF counterpart: it gives quick initial learning and comparable or even superior final control performance.
Online at IEEEXplore.
«
L. Busoniu, R. Munos, Optimistic Planning for Markov Decision Processes. In Proceedings 15th International Conference on Artificial Intelligence and Statistics (AISTATS-12), pages 182–189, La Palma, Canary Islands, Spain, 21–23 April 2012. »
Abstract: The reinforcement learning community has recently intensified its interest in online planning methods, due to their relative independence on the state space size. However, tight near-optimality guarantees are not yet available for the general case of stochastic Markov decision processes and closed-loop, state-dependent planning policies. We therefore consider an algorithm related to AO* that optimistically explores a tree representation of the space of closed-loop policies, and we analyze the near-optimality of the action it returns after n tree node expansions. While this optimistic planning requires a finite number of actions and possible next states for each transition, its asymptotic performance does not depend directly on these numbers, but only on the subset of nodes that significantly impact near-optimal policies. We characterize this set by introducing a novel measure of problem complexity, called the near-optimality exponent. Specializing the exponent and performance bound for some interesting classes of MDPs illustrates the algorithm works better when there are fewer near-optimal policies and less uniform transition probabilities.
The PDF includes supplementary material to the paper, containing proofs of the analytical results.
Online at JMLR Proceedings.
«
M. Vaandrager, R. Babuska, L. Busoniu, G. Lopes, Imitation Learning with Non-Parametric Regression. In Proceedings 2012 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR-12), pages 91–96, Cluj-Napoca, Romania, 24–27 May 2012. »
Abstract: Humans are very fast learners. Yet, we rarely learn a task completely from scratch. Instead, we usually start with a rough approximation of the desired behavior and take the learning from there. In this paper, we use imitation to quickly generate a rough solution to a robotic task from demonstrations, supplied as a collection of state-space trajectories. Appropriate control actions needed to steer the system along the trajectories are then automatically learned in the form of a (nonlinear) state-feedback control law. The learning scheme has two components: a dynamic reference model and an adaptive inverse process model, both based on a data-driven, non-parametric method called local linear regression. The reference model infers the desired behavior from the demonstration trajectories, while the inverse process model provides the control actions to achieve this behavior and is improved online using learning. Experimental results with a pendulum swing-up problem and a robotic arm demonstrate the practical usefulness of this approach. The resulting learned dynamics are not limited to single trajectories, but capture instead the overall dynamics of the motion, making the proposed approach a promising step towards versatile learning machines such as future household robots, or robots for autonomous missions.
Online at IEEEXplore.
«
I. Grondman, M. Vaandrager, L. Busoniu, R. Babuska, E. Schuitema, Actor-Critic Control with Reference Model Learning. In Proceedings 18th IFAC World Congress (IFAC-11), pages 14723–14728, Milano, Italy, 22 August–2 September 2011. »
Abstract: We propose a new actor-critic algorithm for reinforcement learning. The algorithm does not use an explicit actor, but learns a reference model which represents a desired behaviour, along which the process is to be controlled by using the inverse of a learned process model. The algorithm uses Local Linear Regression (LLR) to learn approximations of all the functions involved. The online learning of a process and reference model, in combination with LLR, provides an efficient policy update for faster learning. In addition, the algorithm facilitates the incorporation of prior knowledge. The novel method and a standard actor-critic algorithm are applied to the pendulum swingup problem, in which the novel method achieves faster learning than the standard algorithm.
Online at ScienceDirect.
«
S. Norrouzadeh, L. Busoniu, R. Babuska, Efficient Knowledge Transfer in Shaping Reinforcement Learning. In Proceedings 18th IFAC World Congress (IFAC-11), Milano, Italy, 22 August–2 September 2011. »
Abstract: Reinforcement learning is an attractive solution for deriving an optimal control policy by on-line exploration of the control task. Shaping aims to accelerate reinforcement learning by starting from easy tasks and gradually increasing the complexity, until the original task is solved. In this paper, we consider the essential decision on when to transfer learning from an easier task to a more difficult one, so that the total learning time is reduced. We propose two transfer criteria for making this decision, based on the agent's performance. The first criterion measures the agent's performance by the distance between its current solution and the optimal one, and the second by the empirical return obtained. We investigate the learning time gains achieved by using these criteria in a classical gridworld navigation benchmark. This numerical study also serves to compare several major shaping techniques.
Online at ScienceDirect.
«
L. Busoniu, D. Ernst, B. De Schutter, R. Babuska, Approximate Reinforcement Learning: An Overview. In Proceedings 2011 IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-11), pages 1–8, Paris, France, 11–15 April 2011. »
Abstract: Reinforcement learning (RL) allows agents to learn how to optimally interact with complex environments. Fueled by recent advances in approximation-based algorithms, RL has obtained impressive successes in robotics, artificial intelligence, control, operations research, etc. However, the scarcity of survey papers about approximate RL makes it difficult for newcomers to grasp this intricate field. With the present overview, we take a step toward alleviating this situation. We review methods for approximate RL, starting from their dynamic programming roots and organizing them into three major classes: approximate value iteration, policy iteration, and policy search. Each class is subdivided into representative categories, highlighting among others offline and online algorithms, policy gradient methods, and simulation-based techniques. We also compare the different categories of methods, and outline possible ways to enhance the reviewed algorithms.
Online at IEEEXplore.
«
L. Busoniu, R. Munos, B. De Schutter, R. Babuska, Optimistic Planning for Sparsely Stochastic Systems. In Proceedings 2011 IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-11), pages 48–55, Paris, France, 11–15 April 2011. Part of the Special Session on Active Reinforcement Learning. »
Abstract: We propose an online planning algorithm for finite-action, sparsely stochastic Markov decision processes, in which the random state transitions can only end up in a small number of possible next states. The algorithm builds a planning tree by iteratively expanding states, where each expansion exploits sparsity to add all possible successor states. Each state to expand is actively chosen to improve the knowledge about action quality, and this allows the algorithm to return a good action after a strictly limited number of expansions. More specifically, the active selection method is optimistic in that it chooses the most promising states first, so the novel algorithm is called optimistic planning for sparsely stochastic systems. We note that the new algorithm can also be seen as model-predictive (receding-horizon) control. The algorithm obtains promising numerical results, including the successful online control of a simulated HIV infection with stochastic drug effectiveness.
Online at IEEEXplore.
«
E. Schuitema, L. Busoniu, R. Babuska, P. Jonker, Control Delay in Reinforcement Learning for Real-Time Dynamic Systems: A Memoryless Approach. In Proceedings 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS-10), pages 3226–3231, Taipei, Taiwan, 18–22 October 2010. »
Abstract: Robots controlled by Reinforcement Learning (RL) are still rare. A core challenge to the application of RL to robotic systems is to learn despite the existence of control delay – the delay between measuring a system's state and acting upon it. Control delay is always present in real systems. In this work, we present two novel temporal difference (TD) learning algorithms for problems with control delay. These algorithms improve learning performance by taking the control delay into account. We test our algorithms in a gridworld, where the delay is an integer multiple of the time step, as well as in the simulation of a robotic system, where the delay can have any value. In both tests, our proposed algorithms outperform classical TD learning algorithms, while maintaining low computational complexity.
Online at IEEEXplore.
«
L. Busoniu, D. Ernst, B. De Schutter, R. Babuska, Online Least-Squares Policy Iteration for Reinforcement Learning Control. In Proceedings 2010 American Control Conference (ACC-10), pages 486–491, Baltimore, United States, 30 June – 2 July 2010. »
Abstract: Reinforcement learning is a promising paradigm for learning optimal control. We consider policy iteration (PI) algorithms for reinforcement learning, which iteratively evaluate and improve control policies. State-of-the-art, least-squares techniques for policy evaluation are sample-efficient and have relaxed convergence requirements. However, they are typically used in offline PI, whereas a central goal of reinforcement learning is to develop online algorithms. Therefore, we propose an online PI algorithm that evaluates policies with the so-called least-squares temporal difference for Q-functions (LSTD-Q). The crucial difference between this online least-squares policy iteration (LSPI) algorithm and its offline counterpart is that, in the online case, policy improvements must be performed once every few state transitions, using only an incomplete evaluation of the current policy. In an extensive experimental evaluation, online LSPI is found to work well for a wide range of its parameters, and to learn successfully in a real-time example. Online LSPI also compares favorably with offline LSPI and with a different flavor of online PI, which instead of LSTD-Q employs another least-squares method for policy evaluation.
Online at IEEEXplore.
«
L. Busoniu, B. De Schutter, R. Babuska, D. Ernst, Using Prior Knowledge to Accelerate Online Least-Squares Policy Iteration. In Proceedings 2010 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR-10), Cluj-Napoca, Romania, 28–30 May 2010. »
Abstract: Reinforcement learning (RL) is a promising paradigm for learning optimal control. Although RL is generally envisioned as working without any prior knowledge about the system, such knowledge is often available and can be exploited to great advantage. In this paper, we consider prior knowledge about the monotonicity of the control policy with respect to the system states, and we introduce an approach that exploits this type of prior knowledge to accelerate a state-of-the-art RL algorithm called online least-squares policy iteration (LSPI). Monotonic policies are appropriate for important classes of systems appearing in control applications. LSPI is a data-efficient RL algorithm that we previously extended to online learning, but that did not provide until now a way to use prior knowledge about the policy. In an empirical evaluation, online LSPI with prior knowledge learns much faster and more reliably than the original online LSPI.
Online at IEEEXplore.
«
L. Busoniu, D. Ernst, B. De Schutter, R. Babuska, Policy Search with Cross-Entropy Optimization of Basis Functions. In Proceedings 2009 IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09), pages 153–160, Nashville, United States, 30 March – 2 April 2009. »
Abstract: This paper introduces a novel algorithm for approximate policy search in continuous-state, discrete-action Markov decision processes (MDPs). Previous policy search approaches have typically used ad-hoc parameterizations developed for specific MDPs. In contrast, the novel algorithm employs a flexible policy parameterization, suitable for solving general discrete-action MDPs. The algorithm looks for the best closed-loop policy that can be represented using a given number of basis functions, where a discrete action is assigned to each basis function. The locations and shapes of the basis functions are optimized, together with the action assignments. This allows a large class of policies to be represented. The optimization is carried out with the cross-entropy method and evaluates the policies by their empirical return from a representative set of initial states. We report simulation experiments in which the algorithm reliably obtains good policies with only a small number of basis functions, albeit at sizable computational costs.
The SMC-B 2010 journal article above is a heavily extended and revised version of this paper.
Online at IEEEXplore.
«
L. Busoniu, D. Ernst, B. De Schutter, R. Babuska, Fuzzy Partition Optimization for Approximate Fuzzy Q-iteration. In Proceedings 17th IFAC World Congress (IFAC-08), pages 5629–5634, Seoul, Korea, 6–11 July 2008. »
Abstract: Reinforcement learning (RL) is a widely used learning paradigm for adaptive agents. Because exact RL can only be applied to very simple problems, approximate algorithms are usually necessary in practice. Many algorithms for approximate RL rely on basis-function representations of the value function (or of the Q-function). Designing a good set of basis functions without any prior knowledge of the value function (or of the Q-function) can be a difficult task. In this paper, we propose instead a technique to optimize the shape of a constant number of basis functions for the approximate, fuzzy Q-iteration algorithm. In contrast to other approaches to adapt basis functions for RL, our optimization criterion measures the actual performance of the computed policies in the task, using simulation from a representative set of initial states. A complete algorithm, using cross-entropy optimization of triangular fuzzy membership functions, is given and applied to the car-on-the-hill example.
Online at ScienceDirect.
«
X. Yuan, L. Busoniu, R. Babuska, Reinforcement Learning for Elevator Control. In Proceedings 17th IFAC World Congress (IFAC-08), pages 2212–2217, Seoul, Korea, 6–11 July 2008. »
Abstract: Reinforcement learning (RL) comprises an array of techniques that learn a control policy so as to maximize a reward signal. When applied to the control of elevator systems, RL has the potential of finding better control policies than classical heuristic, suboptimal policies. On the other hand, elevator systems offer an interesting benchmark application for the study of RL. In this paper, RL is applied to a single-elevator system. The mathematical model of the elevator system is described in detail, making the system easy to re-implement and re-use. An experimental comparison is made between the performance of the Q-value iteration and Q-learning RL algorithms, when applied to the elevator system.
Online at ScienceDirect.
«
L. Busoniu, D. Ernst, R. Babuska, B. De Schutter, Consistency of Fuzzy Model-Based Reinforcement Learning. In Proceedings 2008 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE-08), pages 518–524, Hong Kong, 1–6 June 2008. »
Abstract: Reinforcement learning (RL) is a widely used paradigm for learning control. Computing exact RL solutions is generally only possible when process states and control actions take values in a small discrete set. In practice, approximate algorithms are necessary. In this paper, we propose an approximate, model-based Q-iteration algorithm that relies on a fuzzy partition of the state space, and a discretization of the action space. Using assumptions on the continuity of the dynamics and of the reward function, we show that the resulting algorithm is consistent, i.e., that the optimal solution is obtained asymptotically as the approximation accuracy increases. An experimental study indicates that a continuous reward function is also important for a predictable improvement in performance as the approximation accuracy increases.
Online at IEEEXplore.
«
L. Busoniu, D. Ernst, R. Babuska, B. De Schutter, Fuzzy Approximation for Convergent Model-Based Reinforcement Learning. In 2007 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE-07), pages 968–973, London, United Kingdom, 23–26 July 2007. »
Abstract: Reinforcement learning (RL) is a learning control paradigm that provides well-understood algorithms with good convergence and consistency properties. Unfortunately, these algorithms require that process states and control actions take only discrete values. Approximate solutions using fuzzy representations have been proposed in the literature for the case when the states and possibly the actions are continuous. However, the link between these mainly heuristic solutions and the larger body of work on approximate RL, including convergence results, has not been made explicit. In this paper, we propose a fuzzy approximation structure for the Q-value iteration algorithm, and show that the resulting algorithm is convergent. The proof is based on an extension of previous results in approximate RL. We then propose a modified, serial version of the algorithm that is guaranteed to converge at least as fast as the original algorithm. An illustrative simulation example is also provided.
Online at IEEEXplore.
«
L. Busoniu, D. Ernst, R. Babuska, B. De Schutter, Continuous-State Reinforcement Learning with Fuzzy Approximation. In Adaptive Learning Agents and Multi-Agent Systems (ALAMAS-07) Symposium, pages 21–35, Maastricht, The Netherlands, 2–3 April 2007. »
Abstract: Reinforcement learning (RL) is a widely used learning paradigm for adaptive agents. Well-understood RL algorithms with good convergence and consistency properties exist. In their original form, these algorithms require that the environment states and agent actions take values in a relatively small discrete set. Fuzzy representations for approximate, model-free RL have been proposed in the literature for the more difficult case where the state-action space is continuous. In this work, we propose a fuzzy approximation structure similar to those previously used for Q-learning, but we combine it with the model-based Q-value iteration algorithm. We show that the resulting algorithm converges. We also give a modified, serial variant of the algorithm that converges at least as fast as the original version. An illustrative simulation example is provided.
The (downloadable) LNAI 2008 paper above is an extended and revised version of this paper.
«
L. Busoniu, B. De Schutter, R. Babuska, Decentralized Reinforcement Learning Control of a Robotic Manipulator. In Proceedings 9th IEEE International Conference on Control, Automation, Robotics and Vision (ICARCV-06), pages 1347–1352, Singapore, 5–8 December 2006. »
Abstract: Multi-agent systems are rapidly finding applications in a variety of domains, including robotics, distributed control, telecommunications, etc. Learning approaches to multi-agent control, many of them based on reinforcement learning (RL), are investigated in complex domains such as teams of mobile robots. However, the application of decentralized RL to low-level control tasks is not as intensively studied. In this paper, we investigate centralized and decentralized RL, emphasizing the challenges and potential advantages of the latter. These are then illustrated on an example: learning to control a two-link rigid manipulator. In closing, some open issues and future research directions in decentralized RL are outlined.
Keywords: multi-agent learning, decentralized control, reinforcement learning.
Online at IEEEXplore.
«
L. Busoniu, R. Babuska, B. De Schutter, Multi-Agent Reinforcement Learning: A Survey. In Proceedings 9th IEEE International Conference on Control, Automation, Robotics and Vision (ICARCV-06), pages 527–532, Singapore, 5–8 December 2006. »
Abstract: Multi-agent systems are rapidly finding applications in a variety of domains, including robotics, distributed control, telecommunications, etc. Many tasks arising in these domains require that the agents learn behaviors online. A significant part of the research on multi-agent learning concerns reinforcement learning techniques. However, due to different viewpoints on central issues, such as the formal statement of the learning goal, a large number of different methods and approaches have been introduced. In this paper we aim to present an integrated survey of the field. First, the issue of the multi-agent learning goal is discussed, after which a representative selection of algorithms is reviewed. Open issues are identified and future research directions are outlined.
Keywords: multi-agent systems, reinforcement learning, game theory, distributed control.
The (downloadable) SMC-C 2008 journal article above is an extended and revised version of this paper.

Online at IEEEXplore.
«
L. Busoniu, B. De Schutter, R. Babuska, Multiagent Reinforcement Learning with Adaptive State Focus. In Proceedings 17th Belgian-Dutch Conference on Artificial Intelligence (BNAIC-05), pages 35–42, Brussels, Belgium, 17–18 October 2005. »
Abstract: In realistic multi-agent systems, learning on the basis of complete state information is not feasible. We introduce adaptive state focus Q-learning, a class of methods derived from Q-learning that start learning with only the state information that is strictly necessary for a single agent to perform the task, and that monitor the convergence of learning. If lack of convergence is detected, the learner dynamically expands its state space to incorporate more state information (e.g., states of other agents). Learning is faster and takes less resources than if the complete state were considered from the start, while being able to handle situations where agents interfere in pursuing their goals. We illustrate our approach by instantiating a simple version of such a method, and by showing that it outperforms learning with full state information without being hindered by the deficiencies of learning on the basis of a single agent's state.
Keywords: multi-agent learning, adaptive learning, Q-learning, coordination.
Online at PubZone.
«

Theses

L. Busoniu, Optimistic Planning for Nonlinear Optimal Control and Networked Systems, habilitation thesis, 2015, 155 pages. »
Abstract: This thesis deals with the optimal control of nonlinear systems using artificial intelligence techniques. In particular, we exploit the class of optimistic planning methods from AI, and along a first, fundamental research line we extend them to handle several novel aspects, including stochasticity in the dynamics, disturbance modeled conservatively as an an opponent, continuous input actions etc. The second main line of the work applies optimistic planning to unsolved problems in control, including the networked near-optimal control of general nonlinear systems, and the cooperative control of multiagent systems. All techniques come with validation in simulation or real-life experiments, and most of them are theoretically analyzed to guarantee near-optimality and other properties, as a function of the computational effort invested. Applications in robotics are additionally explored. The thesis concludes with an outline of future plans, aiming to achieve the overall long-term goal of a learning and planning framework for the control of complex systems. «
L. Busoniu, Reinforcement Learning in Continuous State and Action Spaces, PhD thesis, 2008, 190 pages, ISBN 978-90-9023754-1. »
Abstract:
Reinforcement learning (RL) and dynamic programming (DP) algorithms can be used to solve problems in a variety of fields, among which automatic control, artificial intelligence, operations research, and economy. These algorithms find an optimal policy, which maximizes a numerical reward signal measuring the performance. DP algorithms require a model of the problem's dynamics, whereas RL algorithms work without a model. Online RL algorithms do not even require data in advance; they learn from experience. However, DP and RL can find exact solutions only when the states and the control actions take values in a small discrete set. In large discrete spaces and in continuous spaces, approximate solutions have to be used. This is the case, e.g., in automatic control, where the states and actions are usually continuous.
This thesis proposes several novel algorithms for approximate RL and DP, which work in problems with continuous variables: fuzzy Q- iteration, online least-squares policy iteration, and cross-entropy policy search. Fuzzy Q-iteration is a DP algorithm that represents the value function (cumulative rewards) using a fuzzy partition of the state space and a discretization of the action space. The value function is used to compute a near-optimal policy. Fuzzy Q-iteration is provably convergent and consistent. Online least-squares policy iteration is a RL algorithm that efficiently learns from experience an approximate value function and a corresponding policy. It updates the value function parameters by solving linear systems of equations. Cross-entropy policy search represents policies using a highly flexible parameterization, and optimizes the parameters with the cross-entropy method. A representative selection of control problems is used to assess the performance of the proposed algorithms. Additionally, the thesis provides an extensive review of the state-of the-art in approximate DP and RL, and discusses some fundamental open issues in the field.
To obtain a bound hardcopy of the thesis free of charge, please contact me (preferably by email) mentioning your name and address, together with your interest in the thesis.
«

National journal papers

K. Mathe, L. Busoniu, L. Miclea, Optimistic Planning with Long Sequences of Identical Actions: An Extended Theoretical and Experimental Study. Acta Electrotehnica, vol. 56, no. 4, pages 27–34, 2015. »
Abstract: Optimistic planning for deterministic systems (OPD) finds near-optimal control solutions for general, non-linear systems. OPD iteratively explores a search tree of action sequences by always expanding further the most promising sequence, where each expansion appends all possible one-step actions. However, the generality of the algorithm comes at a high computational cost. We aim to alleviate this complexity in a subclass of control problems where longer ranges of constant actions are preferred, by adapting OPD to this class of problems. The novel algorithm is called optimistic planning with K identical actions (OKP), and it creates sequences by appending to them up to K repetitions of each possible action. In our analysis we show that indeed, OKP offers a similar a posteriori performance as OKP and in certain cases the tree depth reached (a measure of the performance) is increased compared to OPD. Our experiments, performed on the inverted pendulum and HIV infection treatment control, confirm that for suitable control problems OKP can perform better than OPD, for properly tuned parameter K. «
L. Busoniu, B. De Schutter, R. Babuska, D. Ernst, Exploiting Policy Knowledge in Online Least-Squares Policy Iteration: An Empirical Study. Automation, Computers, Applied Mathematics, vol. 19, no. 4, pages 521–529, 2010. »
Abstract: Reinforcement learning (RL) is a promising paradigm for learning optimal control. Traditional RL works for discrete variables only, so to deal with the continuous variables appearing in control problems, approximate representations of the solution are necessary. The field of approximate RL has tremendously expanded over the last decade, and a wide array of effective algorithms is now available. However, RL is generally envisioned as working without any prior knowledge about the system or the solution, whereas such knowledge is often available and can be exploited to great advantage. Therefore, in this paper we describe a method that exploits prior knowledge to accelerate online least-squares policy iteration (LSPI), a state-of-the-art algorithm for approximate RL. We focus on prior knowledge about the monotonicity of the control policy with respect to the system states. Such monotonic policies are appropriate for important classes of systems appearing in control applications, including for instance nearly linear systems and linear systems with monotonic input nonlinearities. In an empirical evaluation, online LSPI with prior knowledge is shown to learn much faster and more reliably than the original online LSPI. «

Abstracts, workshop presentations

E. Gorski,V. Varma, L. Busoniu, C. Morarescu, Control design for robust platoons with communication delays. Presented at the European Nonlinear Dynamics Conference, Delft, The Netherlands, 22–26 July 2024. »
Abstract: This work investigates platooning systems employing Adaptive Cruise Control (ACC) and Cooperative Adaptive Cruise Control (CACC) strategies, with the main focus being on controller parameter design for string stability under communication delays. ACC relies on onboard sensors, while CACC augments this capability with vehicle-to-vehicle (V2V) communication and both systems are used to synchronize vehicle speeds and maintain precise inter-vehicular spacing in platoons. Our goal is to find conditions on the controller’s proportional and derivative gains to balance individual vehicle stability and string stability. The latter is crucial to ensure that spacing errors do not amplify as vehicles traverse the platoon, a fundamental requirement for safe and efficient platooning. We provide a method for characterizing controller gains for string and individual stability based on the well-known D-decomposition in the time-delay systems literature. Numerical simulations validate the theoretical results. «
L. Busoniu, A. Daniels, R. Munos, R. Babuska, Optimistic Planning for Continuous-Action Deterministic Systems. Presented at the 8emes Journees Francophones sur la Planification, la Decision et l'Apprentissage pour la conduite de systemes (JFPDA-13), Lille, France, 1–2 July 2013. »
Abstract: We consider the optimal control of systems with deterministic dynamics, continuous, possibly large-scale state spaces, and continuous, low-dimensional action spaces. We describe an online planning algorithm called SOOP, which like other algorithms in its class has no direct dependence on the state space structure. Unlike previous algorithms, SOOP explores the true solution space, consisting of infinite sequences of continuous actions, without requiring knowledge about the smoothness of the system. To this end, it borrows the principle of the simultaneous optimistic optimization method, and develops a nontrivial adaptation of this principle to the planning problem. Experiments on four problems show SOOP reliably ranks among the best algorithms, fully dominating competing methods when the problem requires both long horizons and fine discretization.
This is an extended version of the ADPRL-13 paper with the same title.
«
L. Busoniu, R. Munos, B. De Schutter, R. Babuska, Optimistic Planning for Sparsely Stochastic Systems. Presented at the 2011 Workshop on Monte-Carlo Tree Search: Theory and Applications, within the 21st International Conference on Automated Planning and Scheduling (ICAPS-11), Freiburg, Germany, 12 June 2011. »
Abstract: We describe an online planning algorithm for finite-action, sparsely stochastic Markov decision processes, in which the random state transitions can only end up in a small number of possible next states. The algorithm builds a planning tree by iteratively expanding states, where the most promising states are expanded first, in an optimistic procedure aiming to return a good action after a strictly limited number of expansions. The novel algorithm is called optimistic planning for sparsely stochastic systems. «

Disclaimer: The following applies to the papers that are directly available for download as PDF files. This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each copyright holder. In most cases, these works may not be reposted without the explicit permission of the copyright holder. Additionally, the following applies to IEEE material: Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE.