Mathematical modelling of control-loss detection via risk-sensitive reinforcement learning on partially observable Markov decision processes
DOI:
https://doi.org/10.64700/altay.17Keywords:
Reinforcement learning, partially observable Markov decision process, risk-sensitive metric, explainable AI techniquesAbstract
This work introduces a method for identifying high-risk loss-of-control episodes in digital settings by combining risk-sensitive reinforcement learning with decision-making under partial observability. We motivate the need to reason with incomplete and noisy information — typical of real-world deployments and, in particular, of monitoring user behaviour during critical states. The agent-environment interaction is modelled within the partially observable Markov decision process formalism, which maintains a belief (probabilistic posterior) over latent states given histories of actions and observations. Behaviour is analysed at the trajectory level, and tail risk is quantified via the Conditional Value at Risk (CVaR), enabling the assessment of expected losses in worst-case regimes rather than average-case performance. To ensure transparency and foster trust, we integrate explainable AI (XAI) techniques that reveal the factors driving risk estimates and action choices. The resulting pipeline provides a principled basis for adaptive detection of critical states and for early-warning interventions in complex digital environments, supporting reliable and accountable decision support.
References
[1] A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-Lopez, D. Molina, R. Benjamins, R. G. Chatila and F. Herrera: Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI, Information Fusion, 58 (2020), 82–115.
[2] R. J. Boucherie, N. M. van Dijk: Markov decision processes in practice, Springer International Publishing, Cham, Switzerland (2017).
[3] A. R. Cassandra: A survey of POMDP applications, Working notes of the AAAI 1998 Fall Symposium on Planning with Partially Observable Markov Decision Processes, (1998), 17–24.
[4] X. Chen, Y. Mu, P. Luo, S. E. Li and J. Chen: Flow-based recurrent belief state learning for POMDPs, Proceedings of the 39th International Conference on Machine Learning, 162 (2022), 3444–3468.
[5] D. Duffie, J. Pan: An overview of value at risk, Journal of Derivatives, 4 (3) (1997), 7–49.
[6] R. Dwivedi, D. Dave, H. Naik, S. Singhal, R. Omer, P. Patel, B. Qian, Z. Wen, T. Shah, G. Morgan and R. Ranjan: Explainable AI (XAI): Core ideas, techniques, and solutions, ACM Computing Surveys, 55 (9) (2023), 1–33.
[7] R. Figueiredo Prudencio, M. R. O. A. Maximo and E. L. Colombini: A survey on offline reinforcement learning: Taxonomy, review, and open problems, IEEE Transactions on Neural Networks and Learning Systems, 35 (8) (2024), 10237–10257.
[8] Y. Fu, D. Wu and B. Boulet: A closer look at offline RL agents, Advances in Neural Information Processing Systems, 35 (2022), 8591–8604.
[9] F. Garcia, E. Rachelson: Markov decision processes, Markov decision processes in artificial intelligence,Wiley, (2013), 1–38.
[10] P. Jorion: Risk2: Measuring the risk in value at risk, Financial Analysts Journal, 52 (6) (1996), 47–56.
[11] L. P. Kaelbling, M. L. Littman and A. R. Cassandra: Planning and acting in partially observable stochastic domains, Artificial Intelligence, 101 (1-2) (1998), 99–134.
[12] T. J. Linsmeier, N. D. Pearson: Value at risk, Financial Analysts Journal, 56 (2) (2000), 47–67.
[13] W. S. Lovejoy: A survey of algorithmic methods for partially observed Markov decision processes, Annals of Operations Research, 28 (1) (1991), 47–65.
[14] X. Ni, L. Lai: Robust risk-sensitive reinforcement learning with conditional value-at-risk, Proceedings of the 2024 IEEE Information Theory Workshop, (2024), 520–525.
[15] M. L. Puterman: Markov decision processes, Stochastic models, Handbooks in Operations Research and Management Science, Elsevier, 2 (1990), 331–434.
[16] N. Roy, G. Gordon and S. Thrun: Finding approximate POMDP solutions through belief compression, Journal of Artificial Intelligence Research, 23 (2005), 1–40.
[17] A. S. Sinha, A. Mahajan: Agent-state based policies in POMDPs: Beyond belief-state MDPs, Proceedings of the 63rd IEEE Conference on Decision and Control, (2024), 6722–6735.
[18] R. S. Sutton, A. G. Barto: Reinforcement learning: An introduction, (1st ed.), Cambridge, MA: MIT Press (1998).
[19] C. Szepesvári: Reinforcement learning algorithms for MDPs,Wiley encyclopedia of operations research and management science, (2011).
[20] M. A. Wiering, M. Van Otterlo: Reinforcement learning, Adaptation, Learning, and Optimization, 12, Springer, Berlin, Germany (2012).
[21] Inclusion of gaming disorder in ICD-11, World Health Organization (2018). Retrieved August 15, 2025, https://www.who.int/news/item/14-09-2018-inclusion-of-gaming-disorder-in-icd-11
[22] Gaming disorder, World Health Organization (2025). Retrieved August 15, 2025, https://www.who.int/ standards/classifications/frequently-asked-questions/gaming-disorder
[23] Y. Zhao,W. Zhan, X. Hu, H. F. Leung, F. Farnia,W. Sun and J. D. Lee: Provably efficient CVaR RL in low-rank MDPs, arXiv preprint, (2023), arXiv:2311.11965.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Oleksandr Chaban, Volodymyr Hladun

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.