Reinforcement Learning and Policy Optimization for Stock Trading: A Graph-Based State Space Approach to Sequential Decision Making

Lucas Bennett

Vol. 1 No. 1 (2026), Articles

Vol. 1 No. 1 (2026)

Reinforcement Learning and Policy Optimization for Stock Trading: A Graph-Based State Space Approach to Sequential Decision Making

Articles

Published 2026-05-10

Lucas Bennett

Lucas Bennett

Keywords

Reinforcement Learning

Abstract

Stock trading is inherently a sequential decision-making problem wherein agents must balance immediate rewards against long-term portfolio growth while managing risk in an uncertain market environment. Traditional approaches to algorithmic trading rely on supervised learning for price prediction combined with heuristic rules for position sizing and risk management, failing to jointly optimize the complete trading pipeline in an end-to-end manner. This paper proposes a novel framework, RL-S3G (Reinforcement Learning Stock State Space Graph), that integrates reinforcement learning policy optimization with the Stock State Space Graph architecture for end-to-end learning of optimal trading policies. Our approach leverages the S3G model introduced by Lu, Hu, and Zhang as the environment representation within a Markov Decision Process framework, enabling an RL agent to learn trading strategies that directly optimize risk-adjusted returns. We adapt explanation-based bias decoupling regularization principles from Zang and Liu's work on natural language inference to improve the robustness of learned policies to market regime changes. Through extensive experiments on benchmark financial datasets, we demonstrate that RL-S3G substantially outperforms both supervised learning baselines and existing reinforcement learning trading agents in terms of Sharpe ratio, maximum drawdown, and risk-adjusted performance. Our work contributes to the growing body of research on deep reinforcement learning in quantitative finance, providing a principled framework for learning adaptive trading policies in complex market environments.

References

1. Xu, G., Murthy, S. V., & Jia, B. (2025). Enhancing intuitive decision-making and reliance through human–AI collaboration: A review. Informatics, 12(4), 135. https://doi.org/10.3390/informatics12040135

2. Cetinkaya, N. E., & Krämer, N. (2026). Between transparency and trust: Identifying key factors in AI system perception. Behaviour & Information Technology, 45(5), 840–854. https://doi.org/10.1080/0144929X.2025.2533358

3. Gonzalez, C., & Heidari, H. (2025). A cognitive approach to human–AI complementarity in dynamic decision-making. Nature Reviews Psychology, 4, 808–822. https://doi.org/10.1038/s44159-025-00499-x

4. Wang, J. Z. (2026). The Verification Tax: Fundamental Limits of AI Auditing in the Rare-Error Regime. arXiv preprint arXiv:2604.12951. https://doi.org/10.48550/arXiv.2604.12951

5. Bansal, G., Nushi, B., Kamar, E., Lasecki, W. S., Weld, D. S., & Horvitz, E. (2019). Beyond accuracy: The role of mental models in human-AI team performance. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 7(1), 2–11. https://doi.org/10.1609/hcomp.v7i1.5285

6. Tariq, S., Chhetri, M. B., Nepal, S., & Paris, C. (2025). A²C: A modular multi-stage collaborative decision framework for human–AI teams. Expert Systems with Applications, 282, 127318. https://doi.org/10.1016/j.eswa.2025.127318

7. McInerney, T. (2026). The algorithmic construction of epistemic injustice. In M. L. Flear, C. Davies-Tyrie, & D. Wincott (Eds.), Socio-Legal Studies of Epistemic Injustice and Spaces and Places (pp. 123–154). Palgrave Macmillan. https://doi.org/10.1007/978-3-032-07581-9_5

8. Hemmer, P., Schemmer, M., Kühl, N., Vössing, M., & Satzger, G. (2025). Complementarity in human-AI collaboration: Concept, sources, and evidence. European Journal of Information Systems, 34(6), 979–1002. https://doi.org/10.1080/0960085X.2025.2475962

9. Wang, J. Z. (2026). MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in Large Language Models. arXiv preprint arXiv:2604.19809. https://doi.org/10.48550/arXiv.2604.19809

10. Bei, J., Liu, Z., Huang, J., Wang, X., & Yang, P. (2025, December). Strategic Human Resource Analytics with Explainable Artificial Intelligence: An Interpretable Prediction Framework for Employee Promotion to Support Managerial Decision-Making. In Proceedings of the 2025 6th International Conference on Computer Science and Management Technology (ICCSMT 2025) (pp. 77–82). ACM. https://doi.org/10.1145/3795154.3795166

11. Gonzalez, C. (2026). Toward a science of human–AI teaming for decision making: A complementarity framework. PNAS Nexus, 5(3), pgag030. https://doi.org/10.1093/pnasnexus/pgag030

12. Inkpen, K., Chappidi, S., Mallari, K., Nushi, B., Ramesh, D., Michelucci, P., Mandava, V., Vepřek, L. H., & Quinn, G. (2023). Advancing Human-AI Complementarity: The impact of user expertise and algorithmic tuning on joint decision making. ACM Transactions on Computer-Human Interaction, 30(5), 1–29. https://doi.org/10.1145/3534561

13. Schoeffer, J., De-Arteaga, M., & Kühl, N. (2024). Explanations, fairness, and appropriate reliance in human-AI decision-making. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24) (Article 836, pp. 1–18). ACM. https://doi.org/10.1145/3613904.3642621

14. Fügener, A., Walzner, D. D., & Gupta, A. (2025). Roles of artificial intelligence in collaboration with humans: Automation, augmentation, and the future of work. Management Science, 72(1), 538–557. https://doi.org/10.1287/mnsc.2024.05684

15. Pu, J., Chang, Y., Gao, S., Bao, S., Yan, K., Sun, X., Carvalhais, N., & Myneni, R. B. (2025). MCI GPP: Ensembling a global model- and climate-independent gross primary productivity for 2001–2023. Scientific Data, 12, 1965. https://doi.org/10.1038/s41597-025-06218-8

16. Chang, Y., Winkler, A. J., Noori, A., Knyazikhin, Y., & Myneni, R. B. (2025). Precipitation leads the long-term vegetation increase in the conterminous United States drylands. Environmental Research Letters, 20(4), 044006. https://doi.org/10.1088/1748-9326/adb985

17. Dai, Y., Chen, Z., Pradeepkumar, J., Matsubara, Y., Sun, J., Sakurai, Y., & Dong, Y. (2026). EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild. arXiv preprint arXiv:2605.09505. https://doi.org/10.48550/arXiv.2605.09505

18. Lu, Y., Hu, K., & Zhang, L. (2026). S³G: Stock State Space Graph for Enhanced Stock Trend Prediction. In ICASSP 2026–2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4081–4085). IEEE. https://doi.org/10.48550/arXiv.2603.24236