前言
最近想给自己写的引擎跑谱调优,需要一个离线测试平台,看到有老外用 Python 写的,还不错,但需要做大量改动才行,老外的是线上跑谱,和 GitHub 绑定,需要修改代码。
坑倒是不少,有些没完全写出来。
引擎要有 Tune 模块才行,调优是作为参数修改。
使用时需要掌握 SPRT、SPSA 以及数学统计类知识。
部署
服务端
-
需要使用 Ubuntu 18.04 作为系统来部署(推荐使用 Server 版本)。
-
复制脚本
setup-fishtest.sh
:- 修改
usrpwd
变量为你的密码。 - 修改
hostname
变量为你的域名(扩展,如果使用 HTTPS 的话)。
- 修改
-
使用如下命令运行脚本:
sudo bash setup-fishtest.sh 2>&1 | tee setup-fishtest.sh.log
setup-fishtest.sh
#!/bin/bash
# 201025
# to setup a fishtest server on Ubuntu 18.04 (bionic), simply run:
# sudo bash setup_fishtest.sh 2>&1 | tee setup_fishtest.sh.log
#
# to use fishtest connect a browser to:
# http://<ip_address> or http://<fully_qualified_domain_name>
user_name='fishtest'
user_pwd='<your_password>'
server_name=$(hostname --all-ip-addresses)
# use a fully qualified domain names (http/https)
# server_name='<fully_qualified_domain_name>'
git_user_name='your_name'
git_user_email='[email protected]'
# create user for fishtest
useradd -m -s /bin/bash ${user_name}
echo ${user_name}:${user_pwd} | chpasswd
usermod -aG sudo ${user_name}
sudo -i -u ${user_name} << EOF
mkdir .ssh
chmod 700 .ssh
touch .ssh/authorized_keys
chmod 600 .ssh/authorized_keys
EOF
# get the user $HOME
user_home=$(sudo -i -u ${
user_name} << 'EOF'
echo ${HOME}
EOF
)
# add some bash variables
sudo -i -u ${user_name} << 'EOF'
cat << 'EOF0' >> .profile
export FISHTEST_HOST=127.0.0.1
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
export VENV="$HOME/fishtest/server/env"
EOF0
EOF
# set secrets
sudo -i -u ${user_name} << EOF
echo '' > fishtest.secret
echo '' > fishtest.captcha.secret
echo '' > fishtest.upload
cat << EOF0 > .netrc
# GitHub authentication to raise API rate limit
# create a <personal-access-token> https://github.com/settings/tokens
#machine api.github.com
#login <personal-access-token>
#password x-oauth-basic
EOF0
chmod 600 .netrc
EOF
# install required packages
apt update && apt full-upgrade -y && apt autoremove -y && apt clean
apt purge -y apache2 apache2-data apache2-doc apache2-utils apache2-bin
apt install -y ufw git bash-completion nginx mutt curl procps
# configure ufw
ufw allow ssh
ufw allow http
ufw allow https
ufw allow 6542
ufw --force enable
ufw status verbose
# configure nginx
cat << EOF > /etc/nginx/sites-available/fishtest.conf
upstream backend_tests {
server 127.0.0.1:6543;
}
upstream backend_all {
server 127.0.0.1:6544;
}
server {
listen 80;
listen [::]:80;
server_name ${server_name};
location ~ ^/(css|html|img|js|favicon.ico|robots.txt) {
root ${user_home}/fishtest/server/fishtest/static;
expires 1y;
add_header Cache-Control public;
access_log off;
}
location / {
proxy_pass http://backend_all;
proxy_set_header X-Forwarded-Proto \$scheme;
proxy_set_header X-Forwarded-For \$proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Host \$host:\$server_port;
proxy_set_header X-Forwarded-Port \$server_port;
client_max_body_size 100m;
client_body_buffer_size 128k;
proxy_connect_timeout 60s;
proxy_send_timeout 90s;
proxy_read_timeout 90s;
proxy_buffering off;
proxy_temp_file_write_size 64k;
proxy_redirect off;
location ~ ^/api/(active_runs|download_pgn|download_pgn_100|request_version|upload_pgn) {
proxy_pass http://backend_all;
}
location /api/ {
proxy_pass http://backend_tests;
}
location ~ ^/tests/(finished|user/) {
proxy_pass http://backend_all;
}
location /tests {
proxy_pass http://backend_tests;
}
}
}
EOF
unlink /etc/nginx/sites-enabled/default
ln -sf /etc/nginx/sites-available/fishtest.conf /etc/nginx/sites-enabled/fishtest.conf
systemctl enable nginx.service
systemctl restart nginx.service
# setup pyenv and install the latest python version
# https://github.com/pyenv/pyenv
apt update
apt install -y --no-install-recommends make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev
sudo -i -u ${user_name} << 'EOF'
cat << 'EOF0' >> .profile
# pyenv: keep at the end of the file
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
if command -v pyenv &>/dev/null; then
eval "$(pyenv init -)"
fi
EOF0
EOF
sudo -i -u ${user_name} << 'EOF'
python_ver="3.8.6"
git clone https://github.com/pyenv/pyenv.git "${PYENV_ROOT}"
pyenv install ${python_ver}
pyenv global ${python_ver}
EOF
# install mongodb community edition for Ubuntu 18.04 (bionic), change for other releases
wget -qO - https://www.mongodb.org/static/pgp/server-4.4.asc | sudo apt-key add -
echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.4 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.4.list
apt update
apt install -y mongodb-org
# set the cache size in /etc/mongod.conf
# wiredTiger:
# engineConfig:
# cacheSizeGB: 1.75
cp /etc/mongod.conf mongod.conf.bkp
sed -i 's/^# wiredTiger:/ wiredTiger:\n engineConfig:\n cacheSizeGB: 1.75/' /etc/mongod.conf
# setup logrotate for mongodb
sed -i '/^ logAppend: true/a\ logRotate: reopen' /etc/mongod.conf
cat << 'EOF' > /etc/logrotate.d/mongod
/var/log/mongodb/mongod.log
{
daily
missingok
rotate 14
compress
delaycompress
notifempty
create 0600 mongodb mongodb
sharedscripts
postrotate
/bin/kill -SIGUSR1 $(pgrep mongod 2>/dev/null) 2>/dev/null || true
endscript
}
EOF
# download fishtest
sudo -i -u ${user_name} << EOF
git clone --single-branch --branch master https://github.com/glinscott/fishtest.git
cd fishtest
git config user.email "${git_user_email}"
git config user.name "${git_user_name}"
EOF
# setup fishtest
sudo -i -u ${user_name} << 'EOF'
python3 -m venv ${VENV}
${VENV}/bin/python3 -m pip install --upgrade pip setuptools wheel
cd ${HOME}/fishtest/server
${VENV}/bin/python3 -m pip install -e .
EOF
# install fishtest as systemd service
cat << EOF > /etc/systemd/system/[email protected]
[Unit]
Description=Fishtest Server port %i
After=network.target mongod.service
[Service]
Type=simple
ExecStart=${user_home}/fishtest/server/env/bin/pserve production.ini http_port=%i
Restart=on-failure
RestartSec=3
User=${user_name}
WorkingDirectory=${user_home}/fishtest/server
[Install]
WantedBy=multi-user.target
EOF
# install also fishtest debug as systemd service
cat << EOF > /etc/systemd/system/fishtest_dbg.service
[Unit]
Description=Fishtest Server Debug port 6542
After=network.target mongod.service
[Service]
Type=simple
ExecStart=${user_home}/fishtest/server/env/bin/pserve development.ini --reload
User=${user_name}
WorkingDirectory=${user_home}/fishtest/server
[Install]
WantedBy=multi-user.target
EOF
# enable the autostart for mongod.service and [email protected]
# check the log with: sudo journalctl -u [email protected]
systemctl daemon-reload
systemctl enable mongod.service
systemctl enable fishtest@{
6543..6544}.service
# start fishtest server
systemctl start mongod.service
systemctl start fishtest@{
6543..6544}.service
# add mongodb indexes
sudo -i -u ${user_name} << 'EOF'
${VENV}/bin/python3 ${HOME}/fishtest/server/utils/create_indexes.py actions flag_cache pgns runs users
EOF
# add some default users:
# "user00" (with password "user00"), as approver
# "user01" (with password "user01"), as normal user
sudo -i -u ${user_name} << 'EOF'
${VENV}/bin/python3 << EOF0
from fishtest.rundb import RunDb
rdb = RunDb()
rdb.userdb.create_user('user00', 'user00', '[email protected]')
rdb.userdb.add_user_group('user00', 'group:approvers')
user = rdb.userdb.get_user('user00')
user['blocked'] = False
user['machine_limit'] = 100
rdb.userdb.save_user(user)
rdb.userdb.create_user('user01', 'user01','[email protected]')
user = rdb.userdb.get_user('user01')
user['blocked'] = False
user['machine_limit'] = 100
rdb.userdb.save_user(user)
EOF0
EOF
sudo -i -u ${user_name} << 'EOF'
(crontab -l; cat << EOF0
VENV=${HOME}/fishtest/server/env
UPATH=${HOME}/fishtest/server/utils
# Backup mongodb database and upload to s3
# keep disabled on dev server
# 3 */6 * * * /usr/bin/cpulimit -l 50 -f -m -- sh ${UPATH}/backup.sh
# Update the users table
1,16,31,46 * * * * \${VENV}/bin/python3 \${UPATH}/delta_update_users.py
# Purge old pgn files
33 3 * * * \${VENV}/bin/python3 \${UPATH}/purge_pgn.py
# Clean up old mail (more than 9 days old)
33 5 * * * screen -D -m mutt -e 'push D~d>9d<enter>qy<enter>'
EOF0
) | crontab -
EOF
cat << EOF
connect a browser to:
http://${server_name}
EOF
效果图,里面有一个已完成的测试:
客户端
其实服务端有客户端的代码,但需要折腾(运行、编译环境什么的),有老外已经做好的可移植版(Windows 端)。
跑谱器需要自己实现,又是一个坑,要改大量 C++ 代码,可以根据开源项目 cutechess 改,此处省略 1 万行代码。
改离线使用
逻辑差异
在线使用
新建任务时:会向 GitHub 查询信息,获取 Bench 值、提交日志等等……
运行任务时:在线下载源码、开局库、权重、跑谱器等等,离线编译,然后运行。
离线使用
新建任务时:自己生成信息(sha1),不管 Bench 值。
运行任务时:手动复制需要跑谱的程序、跑谱器等等,不需要编译,直接运行。
总体来说,就是移除所有在线的代码,例如访问 GitHub 的代码。
缺点:好像只能跑一个任务,解决方法是有的,复制多个引擎进跑谱器目录,提前算好 sha1 值(重命名,后缀式),分开拷贝几个客户端。
修改代码
服务端
先改前端,把那些依赖的文件下载好,改成离线使用,不然卡死。
我一共下载了 3 个,bootstrap.min.js、bootstrap-combined.min.css、jquery-3.5.1.min.js,放以下路径,根据文件类型区分。
不需要联网查询 IP 地址详细信息,注释以下代码。
api.py
def get_flag(self):
# ip = self.request.remote_addr
# if ip in flag_cache:
# return flag_cache.get(ip, None) # Handle race condition on "del"
# # concurrent invocations get None, race condition is not an issue
# flag_cache[ip] = None
# result = self.request.userdb.flag_cache.find_one({"ip": ip})
# if result:
# flag_cache[ip] = result["country_code"]
# return result["country_code"]
# try:
# # Get country flag from worker IP address
# FLAG_HOST = "https://freegeoip.app/json/"
# r = requests.get(FLAG_HOST + self.request.remote_addr, timeout=1.0)
# if r.status_code == 200:
# country_code = r.json()["country_code"]
# self.request.userdb.flag_cache.insert_one(
# {
# "ip": ip,
# "country_code": country_code,
# "geoip_checked_at": datetime.utcnow(),
# }
# )
# flag_cache[ip] = country_code
# return country_code
# raise Error("flag server failed")
# except:
# del flag_cache[ip]
# print("Failed GeoIP check for {}".format(ip))
return None
在新建测试任务的时候,会向 GitHub 查询信息,需要去掉,以下是改动的地方。
worker.py
def worker(worker_info, password, remote):
global ALIVE, FLEET
payload = {
"worker_info": worker_info, "password": password}
try:
print("Fetch task...")
# if not get_rate():
# raise Exception("Near API limit")
注释掉 if not get_rate()
这两行,这是检测 GitHub 用量是否达到上限的。
games.py
# create new engines
sha_new = run["args"]["resolved_new"]
sha_base = run["args"]["resolved_base"]
new_engine_name = "chameleon_" + sha_new
base_engine_name = "chameleon_" + sha_base
new_engine = os.path.join(testing_dir, new_engine_name + EXE_SUFFIX)
base_engine = os.path.join(testing_dir, base_engine_name + EXE_SUFFIX)
sylvan = os.path.join(testing_dir, "sylvan-cli" + EXE_SUFFIX)
print("new_engine_name " + str(new_engine_name))
print("base_engine_name " + str(base_engine_name))
# Build from sources new and base engines as needed
# if not os.path.exists(new_engine):
# setup_engine(
# new_engine,
# worker_dir,
# testing_dir,
# remote,
# sha_new,
# repo_url,
# worker_info["concurrency"],
# )
# if not os.path.exists(base_engine):
# setup_engine(
# base_engine,
# worker_dir,
# testing_dir,
# remote,
# sha_base,
# repo_url,
# worker_info["concurrency"],
# )
os.chdir(testing_dir)
# Download book if not already existing
# if (
# not os.path.exists(os.path.join(testing_dir, book))
# or os.stat(os.path.join(testing_dir, book)).st_size == 0
# ):
# zipball = book + ".zip"
# setup(zipball, testing_dir)
# zip_file = ZipFile(zipball)
# zip_file.extractall()
# zip_file.close()
# os.remove(zipball)
# Download sylvan if not already existing
# if not os.path.exists(sylvan):
# if len(EXE_SUFFIX) > 0:
# zipball = "sylvan-cli-win.zip"
# else:
# zipball = "sylvan-cli-linux-{}.zip".format(platform.architecture()[0])
# setup(zipball, testing_dir)
# zip_file = ZipFile(zipball)
# zip_file.extractall()
# zip_file.close()
# os.remove(zipball)
# os.chmod(sylvan, os.stat(sylvan).st_mode | stat.S_IEXEC)
# verify that an available sylvan matches the required minimum version
# verify_required_sylvan(sylvan)
# clean up old networks (keeping the 10 most recent)
networks = glob.glob(os.path.join(testing_dir, "nn-*.nnue"))
if len(networks) > 10:
networks.sort(key=os.path.getmtime)
for old_net in networks[:-10]:
try:
os.remove(old_net)
except:
print("Failed to remove an old network " + str(old_net))
# Add EvalFile with full path to sylvan options, and download networks if not already existing
# net_base = required_net(base_engine)
# if net_base:
# base_options = base_options + [
# "option.EvalFile={}".format(os.path.join(testing_dir, net_base))
# ]
# net_new = required_net(new_engine)
# if net_new:
# new_options = new_options + [
# "option.EvalFile={}".format(os.path.join(testing_dir, net_new))
# ]
# for net in [net_base, net_new]:
# if net:
# if not os.path.exists(os.path.join(testing_dir, net)) or not validate_net(
# testing_dir, net
# ):
# download_net(remote, testing_dir, net)
# if not validate_net(testing_dir, net):
# raise Exception("Failed to validate the network: {}".format(net))
# pgn output setup
pgn_name = "results-" + worker_info["unique_key"] + ".pgn"
if os.path.exists(pgn_name):
os.remove(pgn_name)
pgnfile = os.path.join(testing_dir, pgn_name)
# Verify signatures are correct
verify_signature(
new_engine,
run["args"]["new_signature"],
remote,
result,
games_concurrency * threads,
)
base_nps = verify_signature(
base_engine,
run["args"]["base_signature"],
remote,
result,
games_concurrency * threads,
)
注释以下地方
- 在线下载源码的地方。
- 下载开局库(暂时没有,也用不到)。
- 所有校验引擎的地方。
# Limit worker Github API calls
# if "rate" in worker_info:
# rate = worker_info["rate"]
# limit = rate["remaining"] <= 2 * math.sqrt(rate["limit"])
# else:
limit = False
注释掉那个限制,如上图所示,否则会报错。
客户端
worker.py
if cpu_count <= 0:
sys.stderr.write("Not enough CPUs to run fishtest (it requires at least two)\n")
worker_exit()
""" try:
gcc_version()
except Exception as e:
print(e, file=sys.stderr)
worker_exit() """
with open(config_file, "w") as f:
config.write(f)
if options.only_config == "True":
worker_exit(0)
注释需要 gcc 编译器的地方,前面说了离线不需要编译,太麻烦。
worker.py
def worker(worker_info, password, remote):
global ALIVE, FLEET
payload = {
"worker_info": worker_info, "password": password}
try:
print("Fetch task...")
# if not get_rate():
# raise Exception("Near API limit")
注释掉 if not get_rate()
这两行,这是检测 GitHub 用量是否达到上限的。
games.py
# if int(bench_sig) != int(signature):
# message = "Wrong bench in {} Expected: {} Got: {}".format(
# os.path.basename(engine),
# signature,
# bench_sig,
# )
# payload["message"] = message
# send_api_post_request(remote + "/api/stop_run", payload)
# raise Exception(message)
注释掉 if int(bench_sig) != int(signature):
这两行,离线使用不需要检测引擎。
games.py
print("new_engine_name " + str(new_engine_name))
print("base_engine_name " + str(base_engine_name))
# Build from sources new and base engines as needed
# if not os.path.exists(new_engine):
# setup_engine(
# new_engine,
# worker_dir,
# testing_dir,
# remote,
# sha_new,
# repo_url,
# worker_info["concurrency"],
# )
# if not os.path.exists(base_engine):
# setup_engine(
# base_engine,
# worker_dir,
# testing_dir,
# remote,
# sha_base,
# repo_url,
# worker_info["concurrency"],
# )
os.chdir(testing_dir)
# Download book if not already existing
# if (
# not os.path.exists(os.path.join(testing_dir, book))
# or os.stat(os.path.join(testing_dir, book)).st_size == 0
# ):
# zipball = book + ".zip"
# setup(zipball, testing_dir)
# zip_file = ZipFile(zipball)
# zip_file.extractall()
# zip_file.close()
# os.remove(zipball)
# Download sylvan if not already existing
# if not os.path.exists(sylvan):
# if len(EXE_SUFFIX) > 0:
# zipball = "sylvan-cli-win.zip"
# else:
# zipball = "sylvan-cli-linux-{}.zip".format(platform.architecture()[0])
# setup(zipball, testing_dir)
# zip_file = ZipFile(zipball)
# zip_file.extractall()
# zip_file.close()
# os.remove(zipball)
# os.chmod(sylvan, os.stat(sylvan).st_mode | stat.S_IEXEC)
# verify that an available sylvan matches the required minimum version
# verify_required_sylvan(sylvan)
# clean up old networks (keeping the 10 most recent)
networks = glob.glob(os.path.join(testing_dir, "nn-*.nnue"))
if len(networks) > 10:
networks.sort(key=os.path.getmtime)
for old_net in networks[:-10]:
try:
os.remove(old_net)
except:
print("Failed to remove an old network " + str(old_net))
# Add EvalFile with full path to sylvan options, and download networks if not already existing
# net_base = required_net(base_engine)
# if net_base:
# base_options = base_options + [
# "option.EvalFile={}".format(os.path.join(testing_dir, net_base))
# ]
# net_new = required_net(new_engine)
# if net_new:
# new_options = new_options + [
# "option.EvalFile={}".format(os.path.join(testing_dir, net_new))
# ]
# for net in [net_base, net_new]:
# if net:
# if not os.path.exists(os.path.join(testing_dir, net)) or not validate_net(
# testing_dir, net
# ):
# download_net(remote, testing_dir, net)
# if not validate_net(testing_dir, net):
# raise Exception("Failed to validate the network: {}".format(net))
新增两个 print
,方便重命名引擎后缀式。
注释以下地方
- 在线下载源码的地方。
- 下载开局库(暂时没有,也用不到)。
- 所有校验引擎的地方。
坑点
- 离线连不上,报错。
打开 http://192.168.90.128/api/request_version
,发现显示如下:
Internal Server Error
The server encountered an unexpected internal server error
(generated by waitress)
这个时候,我觉得是改离线时候出的问题,直接看代码,发现 api.py
报错,接着是在 rundb.py
里面,当时我第一感觉,肯定是老外又写傻逼代码了。
上文说了注释掉那个限制,就可以解决,如果还报错,打开 6542
端口看下原因。
- 账号莫名其妙损坏或丢失。
没深入研究过,任务却都在,为了继续使用,只好重新创建回去。
附几个常用脚本:
db.users.update({
'username':'user00'},{
$set:{
'password':'user00'}},{
multi:true})
db.users.update({
'username':'user01'},{
$set:{
'password':'user01'}},{
multi:true})
db.getCollection("users").insert( {
username: "user00",
password: "user00",
"registration_time": ISODate("2020-02-15T06:47:50.853Z"),
blocked: false,
email: "[email protected]",
groups: [
"group:approvers",
"group:approvers",
"group:approvers",
"group:approvers"
],
"tests_repo": "",
"machine_limit": NumberInt("100")
} );
db.users.update({
"_id" : ObjectId("5ea7fc06b8fbd777f6352fc3")},{
$set:{
"password":"user00"}})
db.users.remove({
"_id" : ObjectId("5ea2dcacf911ef92007f2e1d") })
里面的
_id
,修改成自己的。
创建时,密码是 5 位数,但前端限制必须 8 位数以上,所以只能通过命令改。
- 参数趋势图卡 Loading graph……
这个要连 Google 服务器才行,离线使用被 Google 明令禁止,暂时没有办法(也没多大用)。我的做法是在前端删了这个控件。
- 调优速度太慢。
请参阅:关于 SPSA 调优实用的指导方针。