Efficient Board Games Algorithm with Integrated Strategy Value Network

doi:10.3969/j.issn.1006-2475.2025.01.014

Abstract

Abstract: Board games always have been a focus of deep reinforcement learning research due to their complex board configurations and rules， which require a lot of time to find optimal solutions. Current algorithms for chess games use action probability distribution-based methods for action selection during self-play， which leads to inefficient exploration and exploitation. They also require separate neural network computations for strategy and value， resulting in low sample usage and long training times. This paper proposes an efficient chess game algorithm that combines strategy-value networks， replacing the original action selection method with the Geng-Bellman maximum value method. It balances exploration and exploitation in action search using ε-greedy and simulated annealing algorithms. Experimental results show that compared to various classical chess game algorithms， the proposed algorithm achieves a win rate of over 90% against traditional algorithms. Moreover， using Gumbel-max method during training leads to significantly higher Elo ratings compared to traditional action selection methods with low Monte Carlo simulation counts. With training reaching 3000 Elo ratings， the proposed algorithm can save 50% of the time.

Key words: board games, Monte Carlo tree search, Gumbel-max method, ε-greedy algorithm, simulated annealing algorithm ,

CLC Number:

TP183

ZHOU Yi1, TIAN Yongshen1, QIU Yufeng2, GAO Hua1. Efficient Board Games Algorithm with Integrated Strategy Value Network [J]. Computer and Modernization, 2025, 0(01): 86-93.

References

［1］ MCGRATH T， KAPISHNIKOV A， TOMAŠEV N， et al. Acquisition of chess knowledge in alphazero［J］. Proceedings of the National Academy of Sciences， 2022，119（47）. DOI： 10.1073/pnas.2206625119.
［2］ CAMPBELL M， HOANE JR A J， HSU F H. Deep blue［J］. Artificial Intelligence， 2002，134（1-2）：57-83.
［3］杜康豪，宋睿卓，魏庆来. 强化学习在机器博弈上的应用综述［J］. 控制工程， 2021，28（10）：1998-2004.
［4］唐川，陶业荣，麻曰亮. AlphaZero原理与启示［J］. 航空兵器， 2020，27（3）：27-36.
［5］ SILVER D， HUANG A， MADDISON C J， et al. Mastering the game of Go with deep neural networks and tree search［J］. Nature， 2016，529（7587）：484-489.
［6］唐振韬，邵坤，赵冬斌，等. 深度强化学习进展：从AlphaGo到AlphaGo Zero［J］. 控制理论与应用， 2017，34（12）：1529-1546.
［7］ SILVER D， HUBERT T， SCHRITTWIESER J， et al. A general reinforcement learning algorithm that masters chess， shogi， and Go through self-play［J］. Science， 2018，362（6419）：1140-1144.
［8］ YE W R， LIU S H， KURUTACH T， et al. Mastering Atari games with limited data［C］// Proceedings of the 35th International Conference on Neural Information Processing Systems. ACM， 2021：25476-25488.
［9］沈雪雁. 基于蒙特卡洛树与神经网络的五子棋算法的设计与实现［D］. 沈阳：沈阳化工大学， 2021.
［10］ WANG H， PREUSS M， PLAAT A. Adaptive warm-start MCTS in AlphaZero-like deep reinforcement learning［C］// 18th Pacific Rim International Conference on Artificial Intelligence. Springer， 2021：60-71.
［11］ MCCAFFREY J. The UCB1 algorithm for multi-armed bandit problems［J］. MSDN Magazine， 2019，34（8）：58-59.
［12］ XENOU K， CHALKIADAKIS G， AFANTENOS S. Deep reinforcement learning in strategic board game environments［C］// 16th European Conference on Multi-Agent Systems. Springer， 2019：233-248.
［13］ WACHI A， SUI Y N. Safe reinforcement learning in constrained Markov decision processes［C］// Proceedings of the 37th International Conference on Machine Learning. PMLR， 2020：9797-9806.
［14］ SINGH S， OKUN A， JACKSON A. Learning to play Go from scratch［J］. Nature， 2017，550（7676）：336-337.
［15］于飞，郝建国，张中杰. 基于动作概率的强化学习动作探索策略［J］. 计算机应用与软件， 2023，40（5）：184-189.
［16］ JANG E， GU S X， POOLE B. Categorical reparameterization with Gumbel-softmax［J］. arXiv preprint arXiv：1611.
01144， 2016.
［17］ LIU Y H， CAO B Y， LI H H. Improving ant colony optimization algorithm with epsilon greedy and Levy flight［J］. Complex & Intelligent Systems， 2021，7：1711-1722.
［18］李琛，李茂军，杜佳佳. 一种强化学习行动策略ε-greedy的改进方法［J］. 计算技术与自动化， 2019，38（2）：141-145.
［19］ KOOL W， VAN HOOF H， WELLING M. Stochastic beams and where to find them： The gumbel-top-k trick for sampling sequences without replacement［C］// Proceedings of the 36th International Conference on Machine Learning. PMLR， 2019：3499-3508.
［20］ KOSANOGLU F， ATMIS M， TURAN H H. A deep reinforcement learning assisted simulated annealing algorithm for a maintenance planning problem［J］. Annals of Operations Research， 2022，339：79-110.
［21］ HUIJBEN I A M， KOOL W， PAULUS M B， et al. A review of the Gumbel-max trick and its extensions for discrete stochasticity in machine learning［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2023，45（2）：1353-1371.
［22］ PLAAT A， SCHAEFFER J， PIJLS W， et al. A new paradigm for minimax search［J］. arXiv preprint arXiv：1404.
1515， 2014.
［23］ ZHOU A. AlphaGambit： Parallelizing MiniMax for Chess［EB/OL］. （2023-12-21）［2024-02-20］. https：//www.cs.columbia.edu/~sedwards/classes/2023/4995-fall/reports/Al
phaGambit-report.pdf.
［24］ SATO N， IKEDA K. Three types of forward pruning techniques to apply the Alpha beta algorithm to turn-based strategy games［C］// 2016 IEEE Conference on Computational Intelligence and Games （CIG）. IEEE， 2016：1-8.
［25］ BERG A. Statistical analysis of the elo rating system in chess［J］. Chance， 2020，33（3）：31-38.
［26］ XIAO C J， ZHU T， LIN C， et al. Applying determinized MCTS in Chinese military chess［C］// The 26th Chinese Control and Decision Conference （2014 CCDC）. IEEE， 2014：3941-3946.
［27］ LIANG W， YU C， WHITEAKER B， et al. AlphaZero Gomoku［J］. arXiv preprint arXiv：2309.01294， 2023.

[1]	HAN Xiaolong, ZENG Xi, LIU Kun, SHANG Yu. Stance Detection with LoRA-based Fine-tuning General Language Model [J]. Computer and Modernization, 2025, 0(01): 1-6.
[2]	LI Canwei, WU Chunlei, LU Jing, WANG Chunlin, ZHU Mingfei. Seismic Velocity Building Based on Interactive Attention DeMulti Unite [J]. Computer and Modernization, 2025, 0(01): 7-14.
[3]	DONG Yizhou1, PAN Weihua2, ZHANG Nan1, MENG Zhuang1. Real Time IoT Data Processing System Based on StarRocks [J]. Computer and Modernization, 2025, 0(01): 15-19.
[4]	XU Shengchao, CHEN Fuqiang. A Food Safety Risk Warning Method Based on BP Neural Network [J]. Computer and Modernization, 2025, 0(01): 20-24.
[5]	CHEN Siyun1, MA Huaibo2, ZHANG Huajun2, LAN Zining2, CHEN Wenxin2 , HU Jie1, CHANG Sheng1. Optimization and Deployment of Object Detection Algorithm Based on Domestic AI Chips [J]. Computer and Modernization, 2025, 0(01): 25-29.
[6]	WANG Yefang1, JIA Xiaoning1, 2, CHENG Libo1, 2, LI Zhe1, 2. Hyperspectral Image Denoising Using Low Rank Tensor Decomposition and Weighted Group Sparse Regularization [J]. Computer and Modernization, 2025, 0(01): 30-36.
[7]	YUAN Jie, ZHU Yan. Multi-Target Adversarial Cross-domain Recommendation Based on Attributed Heterogeneous Graph [J]. Computer and Modernization, 2025, 0(01): 37-43.
[8]	YAN Xiaoqi, PENG Yiqing, REN Xiaoling. Point Cloud Data Classification Method of PointNet++ with Position Adaptive Convolution [J]. Computer and Modernization, 2025, 0(01): 44-49.
[9]	ZHANG Yue, LI Huayu, ZHANG Zhikang, SHEN Xinyi. Academic Recommendation System Based on Knowledge Graph and Semantic Information [J]. Computer and Modernization, 2025, 0(01): 50-58.
[10]	XIE Zetao, ZHUANG Yi . Transient Fault Detection for Low-orbit Internet Communication System Based on CatBoost [J]. Computer and Modernization, 2025, 0(01): 59-66.
[11]	LIU Haitao, FENG Fan. BD Based Lattice Reduction Assisted Continuous Interference Cancellation Detection Algorithm [J]. Computer and Modernization, 2025, 0(01): 67-73.
[12]	LI Qiusheng1, LI Wei1, ZHOU Dongxu1, GUO Chuang1, JIN Wenxin 2. Optical Fiber Timing Synchronization in Power Systems Based on ARIMA-ELM Algorithm [J]. Computer and Modernization, 2025, 0(01): 74-79.
[13]	XIAO Junbi, FU Tianqi. Real-Time Traffic Classification Method Based on High-dimensional Feature#br# Dimensionality Reduction and Clustering [J]. Computer and Modernization, 2025, 0(01): 80-85.
[14]	LI Xi, PAN Yu. Ground Penetrating Radar Pipeline Object Detection Method Based on Improved YOLOv8 [J]. Computer and Modernization, 2025, 0(01): 94-99.
[15]	TU Fuquan, QI Yanqi, LIU Jian, WANG Shufeng. Metal Gear Surface Defect Detection Algorithm Based on Improved YOLOv8s [J]. Computer and Modernization, 2025, 0(01): 100-106.