1. 卸载重新安装无法登录问题
还记得install.sh吗,我把创建用户关闭了,并且删除了airflow_plus数据库,以下配置导致不创建默认用户,所以登录不进去,设置为true,重新install即可
# 关闭创建用户job
--set webserver.defaultUser.enabled=false \
2. Worker节点重启导致任务失败问题
原因为自定义安装依赖,版本冲突导致。检查重启Worker所在pod的事件,查看信息如下
"/home/airflow/.local/lib/python3.7/site-packages/kombu/entity.py", line 7, in <module> from .serialization import prepare_accept_content File "/home/airflow/.local/lib/python3.7/site-packages/kombu/serialization.py", line 440, in <module> for ep, args in entrypoints('kombu.serializers'): # pragma: no cover File "/home/airflow/.local/lib/python3.7/site-packages/kombu/utils/compat.py", line 82, in entrypoints for ep in importlib_metadata.entry_points().get(namespace, []) AttributeError: 'EntryPoints' object has no attribute 'get'
检查依赖,把依赖版本固定
importlib-metadata==4.13.0
3. 动态安装依赖问题
- 对于单worker节点,可以每次任务前前执行安装方法,但是比较麻烦,不实用:
import logging
import os
log = logging.getLogger(__name__)
def install():
log.info("begin install requirements")
os.system("pip install -r /opt/airflow/dags/repo/dags/requirements.txt")
os.system("pip install -I my_utils")
log.info("finish install requirements")
DAG中的第一个TASK具体执行文件开头引入,这样可以在WebServer Log中可以看到具体日志
import sys
sys.path.insert(0, '/opt/airflow/dags/repo')
import dags.install as install
install.install()
- 对于多worker节点,由于每个任务中下一过程流转可能运行在不同worker上面,所以无解,最简便手工安装
4. Triggerer一直启动中无法定时调度问题
{triggerer_job.py:101} INFO - Starting the triggerer
[2023-03-17T13:47:30.947+0000] {triggerer_job.py:348} ERROR - Triggerer's async thread was blocked for 0.36 seconds, likely by a badly-written trigger. Set PYTHONASYNCIODEBUG=1 to get more information on overrunning coroutines.
[2023-03-17T22:15:17.394+0000] {triggerer_job.py:348} ERROR - Triggerer's async thread was blocked for 0.27 seconds, likely by a badly-written trigger. Set PYTHONASYNCIODEBUG=1 to get more information on overrunning coroutines.
应该是版本问题,我把版本改为了2.2.1,清空数据库,uninstall后重新安装即可【这个问题排查了好久,无奈】,注意在values.yaml中写版本,在install.sh中的版本去掉,不然不生效
airflowVersion: 2.2.1
defaultAirflowTag: 2.2.1
config:
core:
dags_folder: /opt/airflow/dags/repo/dags
hostname_callable: airflow.utils.net.get_host_ip_address
5. .airflowignore忽略文件问题
.airflowignore文件用户忽略检查非DAG文件,放置在dags_folders下,该配置在values.yaml中,我配置的是/opt/airflow/dags/repo/dags,所以放在此处
如配置忽略merge下面的py非DAG文件
jh/merge/*
此时我们在jh目录下就不能命名为merge_dag.py文件,否则在WebServer中不显示该DAG,我们统一改为命名前缀为dag_解决此问题
6. 挂载问题
笔者试过把worker节点进行挂载到cephfs目录,挂载成功了,但是worker节点一直阻塞不成功,放弃中间表为csv方式,改为写入ClickHouse
7. Scheduler 401权限问题
确保Scheduler的部署executor标签为CeleryKubernetesExecutor
,如果是其他会出现该问题
8. Scheduler与Triggerer健康检查问题
Liveness probe failed: No alive jobs found
修改values.yaml,修改存活检查配置重新install即可
scheduler:
livenessProbe:
command: ["bash", "-c", "airflow jobs check --job-type SchedulerJob --allow-multiple --limit 100"]
triggerer:
livenessProbe:
command: ["bash", "-c", "airflow jobs check --job-type TriggererJob --allow-multiple --limit 100"]
希望对你有所帮助,欢迎关注公众号算法小生