Reservoir Operation Optimization by Reinforcement Learning
Planning of reservoir management and optimal operations of surface water resources has always been a critical and strategic concern of all governments. Today, many equipments, facilities, and substantial budgets have been assigned to carry out an optimal scheduling of water and energy resources over long or short periods. Many researchers have been working on these areas to improve the performance of such a system. They usually attempt to apply new mathematical and heuristic techniques to tackle a wide variety of complexities in real-world applications and especially large-scale problems. Stochasticity, nonlinearity/nonconvexity and dimensionality are the main sources of complexity. In other words, there are many techniques, which could circumvent these complexities via some kind of approximations in uncertain environments with complex and unknown relations between various system parameters. In fact, using different methods to optimize the operations of large-scale problems coming along with much unrealistic estimations makes the final solution very imprecise and usually too far from real optimal solution. Moreover, the existing limitations of hardware or software cause some important physical constraints, which prevent various relations between variables and parameters from being considered. In other words, even if all possible relations between parameters in a problem are known and definable, considering all of them simultaneously might make the problem very difficult to solve.
In an optimization model of a real-world application of reservoir operations, there usually exist different objective functions and numerous linear and non-linear constraints. Thus, if the number of variables and parameters in this model make the problem intractable and too large, existing software or hardware might not be able to find an optimal solution using conventional optimization methods in a reasonable time. For example, stochastic dynamic programming (SDP), a well-known technique in the reservoir management, suffers seriously from the curse of dimensionality and of modeling in multi-reservoir systems. Therefore, to overcome this challenge, several ideas have been developed and implemented in past decades: dynamic programming successive approximations (DPSA) (Larson, 1968), incremental dynamic programming (IDP) (Hall et al., 1969), multilevel incremental dynamic programming (MIDP) (Nopmongcol & Askew, 1976), and different aggregation decomposition methods (Turgeon, 1981; Ponnambalam, 1987; Ponnambalam & Adams, 1996).
Using simulation along with optimization techniques could be a promising alternative in water resources management. Labadie believes that a direct linkage between simulation and implementation of reservoir optimization algorithms could be an important key to success in reservoir management in the future (Labadie, 2004).
Different reinforcement learning (RL) techniques, as simulation-based optimization techniques, might be suitable approaches to overcome the curse of dimensionality or at least decreases this difficulty in real-world applications. The mechanism of learning in these approaches is based on interacting with an environment and receiving immediate or delayed feedback through taking actions (Watkins & Dayan, 1992, Sutton & Barto, 1998). In other words, these techniques could start learning without a prior-knowledge of the stochastic behavior of the system; therefore, they are called model-free methods. This means that they do not need to know anything about the behavior of the system at the starting point of the learning process. The agent or decision-maker begins from an arbitrary situation and attempts to interact with the environment. During these interactions, the agent experiences new situations, saves the results and uses them in the future decision-making. It is clear that in the beginning of the learning, for most of the time, the agent encounters new situations which have never been observed. In this situation, the action taken is not based on a prior knowledge. However, after having enough interactions with the environment, the agent can slowly understand the behavior of the system, and thereafter it attempts to utilize this knowledge for more accurate decision-making. Furthermore, the agent usually looks for finding new information about the environment by taking an action randomly.
In fact, different techniques in RL are able to learn continually. In other words, they could be applied in on-line (real time) or off-line (simulation) learning. Using RL in on-line learning from scratch (without any prior knowledge) could be very expensive and troublesome; therefore, it could be initially used as an off-line learning during which a basic understanding of the environment is achieved. This knowledge could be eventually useful to start an on-line learning.
In most real-world applications, the dynamic of the system is continuously changing. RL techniques are substantially able to adapt itself to these changes and to generate adequate responses and reactions to them. Furthermore, in some optimization techniques such as SDP, the final policy should cover all possible states in the system while many of them are practically impossible or unimportant. However, In RL, because of using simulation or on-line interaction with the environment, the focus is on the significant states or those states, which are practically possible.
In this study, one of the well-known and popular techniques in RL called Q Learning is used to find an optimal closed-loop operational policy in a single-reservoir problem with linear objective functions, considering the stochastic nature of inflows into a reservoir. This could be a starting point to tackle the difficulty of finding an optimal solution for multi-reservoir applications in the future. Like the SDP method, the release from the reservoir is a decision variable that should be determined for every storage level as a system state, that is, the water stored in the reservoir. It is assumed that inflow into the reservoir is a normally distributed random variable. Two types of creating admissible actions including the optimistic and pessimistic schemes are investigated. Based on preliminary results in the simulation, the performance of the Q-Learning method will be measured and compared with the results of the SDP technique.
This paper is only available in PDF Format:
View full text PDF