TLDR
MCTS代理实现在本地无错误运行,实现了
嗨,我目前正忙于这个项目,我需要在两周内完成我注册的项目之前完成这个项目。我的任务,我已经完成了,是实现一个代理,在两个国际象棋骑士之间的隔离游戏中,与启发式驱动的minimax代理对抗。游戏的完整实现细节可以在这里找到。在我的项目中,游戏将在一个9 x 11的棋盘上进行,使用位棋盘编码。我的MCTS实现非常简单,紧跟本文(第6页)提供的伪代码。
从本质上讲,一般的MCTS方法包括这4个部分,每个部分都由CustomPlayer类中的以下嵌套函数实现:
>
反向传播-备份\u negamax,更新\u分数
import math
import random
import time
import logging
from copy import deepcopy
from collections import namedtuple
from sample_players import DataPlayer
class CustomPlayer(DataPlayer):
""" Implement your own agent to play knight's Isolation
The get_action() method is the only required method for this project.
You can modify the interface for get_action by adding named parameters
with default values, but the function MUST remain compatible with the
default interface.
**********************************************************************
NOTES:
- The test cases will NOT be run on a machine with GPU access, nor be
suitable for using any other machine learning techniques.
- You can pass state forward to your agent on the next turn by assigning
any pickleable object to the self.context attribute.
**********************************************************************
"""
def get_action(self, state):
""" Employ an adversarial search technique to choose an action
available in the current state calls self.queue.put(ACTION) at least
This method must call self.queue.put(ACTION) at least once, and may
call it as many times as you want; the caller will be responsible
for cutting off the function after the search time limit has expired.
See RandomPlayer and GreedyPlayer in sample_players for more examples.
**********************************************************************
NOTE:
- The caller is responsible for cutting off search, so calling
get_action() from your own code will create an infinite loop!
Refer to (and use!) the Isolation.play() function to run games.
**********************************************************************
"""
logging.info("Move %s" % state.ply_count)
self.queue.put(random.choice(state.actions()))
i = 1
statlist = []
while (self.queue._TimedQueue__stop_time - 0.05) > time.perf_counter():
next_action = self.uct_search(state, statlist, i)
self.queue.put(next_action)
i += 1
def uct_search(self, state, statlist, i):
plyturn = state.ply_count % 2
Stat = namedtuple('Stat', 'state action utility visit nround')
def tree_policy(state):
statecopy = deepcopy(state)
while not statecopy.terminal_test():
# All taken actions at this depth
tried = [s.action for s in statlist if s.state == statecopy]
# See if there's any untried actions left
untried = [a for a in statecopy.actions() if a not in tried]
topop = []
toappend = []
if len(untried) > 0:
next_action = random.choice(untried)
statecopy = expand(statecopy, next_action)
break
else:
next_action = best_child(statecopy, 1)
for k, s in enumerate(statlist):
if s.state == statecopy and s.action == next_action:
visit1 = statlist[k].visit + 1
news = statlist[k]._replace(visit=visit1)
news = news._replace(nround=i)
topop.append(k)
toappend.append(news)
break
update_scores(topop, toappend)
statecopy = statecopy.result(next_action)
return statecopy
def expand(state, action):
"""
Returns a state resulting from taking an action from the list of untried nodes
"""
statlist.append(Stat(state, action, 0, 1, i))
return state.result(action)
def best_child(state, c):
"""
Returns the state resulting from taking the best action. c value between 0 (max score) and 1 (prioritize exploration)
"""
# All taken actions at this depth
tried = [s for s in statlist if s.state == state]
maxscore = -999
maxaction = []
# Compute the score
for t in tried:
score = (t.utility/t.visit) + c * math.sqrt(2 * math.log(i)/t.visit)
if score > maxscore:
maxscore = score
del maxaction[:]
maxaction.append(t.action)
elif score == maxscore:
maxaction.append(t.action)
if len(maxaction) < 1:
logging.error("IndexError: maxaction is empty!")
return random.choice(maxaction)
def default_policy(state):
"""
The simulation to run when visiting unexplored nodes. Defaults to uniform random moves
"""
while not state.terminal_test():
state = state.result(random.choice(state.actions()))
delta = state.utility(self.player_id)
if abs(delta) == float('inf') and delta < 0:
delta = -1
elif abs(delta) == float('inf') and delta > 0:
delta = 1
return delta
def backup_negamax(delta):
"""
Propagates the terminal utility up the search tree
"""
topop = []
toappend = []
for k, s in enumerate(statlist):
if s.nround == i:
if s.state.ply_count % 2 == plyturn:
utility1 = s.utility + delta
news = statlist[k]._replace(utility=utility1)
elif s.state.ply_count % 2 != plyturn:
utility1 = s.utility - delta
news = statlist[k]._replace(utility=utility1)
topop.append(k)
toappend.append(news)
update_scores(topop, toappend)
return
def update_scores(topop, toappend):
# Remove outdated tuples. Order needs to be in reverse or pop will fail!
for p in sorted(topop, reverse=True):
statlist.pop(p)
# Add the updated ones
for a in toappend:
statlist.append(a)
return
next_state = tree_policy(state)
if not next_state.terminal_test():
delta = default_policy(next_state)
backup_negamax(delta)
return best_child(state, 0)
缺少颜色格式确实使代码很难阅读。所以,请随时在我的github上查看。我在本地运行游戏没有任何问题,我的MCTS代理实现了
从我与课程表示的讨论中,我们认为错误可能是由使用random.choice()
引起的。在我的实现中有4个它的用法实例:
我假设游戏实现是正确的,只要状态是终端,调用state.actions()将始终返回可能的移动列表。因此,唯一可以触发此异常的实例是第3项。项目1和4只是从可用的操作中随机选择,同时进行明确的检查以确保random.choice()没有被输入空列表。因此,我对第3项应用了日志记录(即使在本地运行时没有引发异常),果然,在50场比赛后没有捕获任何异常。
我为这篇冗长的文章道歉,但我真的希望有人能抓住我在实现中遗漏的东西。
我建议为您拥有的每个函数编写单元测试。验证您对代码行为的假设。试着全面地测试它,包括角落案例。仅仅需要输入抽象测试数据通常就能揭示解决方案的架构和细节。
此外,你可以尝试让你的经纪人玩任何其他合适的游戏。这些步骤将为您提供一个很好的机会来发现代码中的任何bug,并使其更适合将来重用。
个人觉得,整个 AplphaGo 对于机器学习来说,最核心的算法就是深度学习(Deep Learning)和增强学习(Reinforcement Learning)。蒙特卡洛树搜索 MCTS 是一个搜索框架,将这些机器学习的技术融合在了一起。今天这篇文章的重点在深度学习,增强学习以后再说。 蒙特卡洛树搜索 每个博弈类的人工智能算法的基础都是一个搜索算法。比如我们上学时学习的 A-star 算法,a
1 前言 在上一篇文章中,我们介绍了基于Bellman方程而得到的Policy Iteration和Value Iteration两种基本的算法,但是这两种算法实际上很难直接应用,原因在于依然是偏于理想化的两个算法,需要知道状态转移概率,也需要遍历所有的状态。对于遍历状态这个事,我们当然可以不用做到完全遍历,而只需要尽可能的通过探索来遍及各种状态即可。而对于状态转移概率,也就是依赖于模型Model
问题内容: 我已经为“ 2d有效ising模型”编写了蒙特卡洛模拟,并且试图改善运行时间。 我的代码做什么:我为每个粒子(rgrid和mgrid)创建一个用于粒子数量(r)的矩阵和一个用于磁化强度的矩阵。粒子的自旋可以是-1/1,因此磁化强度范围为[-r,r],步长为2。 然后选择一个随机点和一个随机粒子(+1或-1)。由于概率取决于每个位置的正/负粒子数量,因此我创建了2个数组并将其压缩,这样我
#游卡# 时常:85分钟……是我面过最长的 先自我介绍一下,包括性格上、擅长的、学习经历 项目:两个项目,全都问了而且问的很细 #### Java Java多态了解吗?具体说说? 多线程了解吗?有用到吗?如果一个线程输出奇数,一个线程输出偶数,怎么做到输出1、2、3、4、5……? stringbuilder了解吗?和stringbuffer的区别? spring有那三个特性? springboot
一、顺序性 二分搜索树可以当做查找表的一种实现。 我们使用二分搜索树的目的是通过查找 key 马上得到 value。minimum、maximum、successor(后继)、predecessor(前驱)、floor(地板)、ceil(天花板、rank(排名第几的元素)、select(排名第n的元素是谁)这些都是二分搜索树顺序性的表现。 二、局限性 二分搜索树在时间性能上是具有局限性的。 如下图