当前位置: 首页 > 知识库问答 >
问题:

扩展Docker群集群中Uber节奏的匹配服务后,获得大量决策任务超时

胡致远
2023-03-14

我正在尝试独立运行每个节奏服务,这样我就可以轻松地扩展它们。我的团队正在使用docker群,我们使用Portainer UI管理一切。到目前为止,我已经能够扩展前端服务以拥有两个副本,但是如果我对匹配的服务做同样的事情,我将通过工作流执行获得大量的决策任务超时。最终,执行将成功完成,但需要很长时间。要想有一个想法,使用两个匹配的服务副本需要2分钟,而只使用一个只需要7秒。

这是一个测试环境。我正在使用泊坞化的 cassand db(由于一些预算限制,我们不能使用真正的 cassand db)也许这就是问题所在?Docker 映像配置了以下环境变量:

RINGPOP_BOOTSTRAP_MODE=dns
KEYSPACE=cadence
BIND_ON_IP=0.0.0.0
SKIP_SCHEMA_SETUP=false
VISIBILITY_KEYSPACE=cadence_visibility
CASSANDRA_HOSTNAME=soap_cassandra
RINGPOP_SEEDS=soap_cadence_frontend:7933,soap_cadence_history:7934,soap_cadence_worker:7939
CADENCE_HOME=/etc/cadence
SERVICES=matching

您可以假设上面看不到的任何其他 env var 的默认值

RINGPOP_SEEDS是分配给每个cadence服务的服务名称,如果声明了多个副本,docker swarm将从中创建一个DNS条目以及负载平衡器。

匹配的服务似乎已正确启动,日志:

{"level":"info","ts":"2021-02-18T22:47:36.296Z","msg":"Created RPC dispatcher and listening","service":"cadence-matching","address":"0.0.0.0:7935","logging-call-at":"rpc.go:81"},
{"level":"warn","ts":"2021-02-18T22:47:36.321Z","msg":"Failed to fetch key from dynamic config","key":"system.advancedVisibilityWritingMode","error":"unable to find key","logging-call-at":"config.go:68"},
{"level":"info","ts":"2021-02-18T22:47:36.336Z","msg":"Add new peers by DNS lookup","address":"0.0.0.0","addresses":"[0.0.0.0:7933]","logging-call-at":"clientBean.go:321"},
{"level":"info","ts":"2021-02-18T22:47:36.321Z","msg":"Creating RPC dispatcher outbound","service":"cadence-frontend","address":"0.0.0.0:7933","logging-call-at":"clientBean.go:277"},
{"level":"info","ts":"2021-02-18T22:47:36.441Z","msg":"Starting service matching","logging-call-at":"server.go:217"},
{"level":"warn","ts":"2021-02-18T22:47:36.441Z","msg":"Failed to fetch key from dynamic config","key":"matching.throttledLogRPS","error":"unable to find key","logging-call-at":"config.go:68"},
{"level":"info","ts":"2021-02-18T22:47:36.441Z","msg":"Creating RPC dispatcher outbound","service":"cadence-frontend","address":"127.0.0.1:7933","logging-call-at":"clientBean.go:277"},
{"level":"info","ts":"2021-02-18T22:47:36.442Z","msg":"Add new peers by DNS lookup","address":"127.0.0.1","addresses":"[127.0.0.1:7933]","logging-call-at":"clientBean.go:321"},
{"level":"info","ts":"2021-02-18T22:47:36.713Z","msg":"matching starting","service":"cadence-matching","logging-call-at":"service.go:90"},
{"level":"info","ts":"2021-02-18T22:47:36.734Z","msg":"RuntimeMetricsReporter started","service":"cadence-matching","logging-call-at":"runtime.go:169"},
{"level":"info","ts":"2021-02-18T22:47:36.734Z","msg":"PProf not started due to port not set","logging-call-at":"pprof.go:64"},
{"level":"info","ts":"2021-02-18T22:47:36.799Z","msg":"Current reachable members","component":"service-resolver","service":"cadence-matching","addresses":"[[::]:7935]","logging-call-at":"rpServiceResolver.go:246"},
{"level":"info","ts":"2021-02-18T22:47:36.799Z","msg":"Current reachable members","component":"service-resolver","service":"cadence-worker","addresses":"[[::]:7939]","logging-call-at":"rpServiceResolver.go:246"},
{"level":"info","ts":"2021-02-18T22:47:36.800Z","msg":"Current reachable members","component":"service-resolver","service":"cadence-frontend","addresses":"[[::]:7933]","logging-call-at":"rpServiceResolver.go:246"},
{"level":"info","ts":"2021-02-18T22:47:36.814Z","msg":"service started","service":"cadence-matching","logging-call-at":"resourceImpl.go:383"},
{"level":"info","ts":"2021-02-18T22:47:36.814Z","msg":"matching started","service":"cadence-matching","logging-call-at":"service.go:99"}

在执行工作流时,我可以在日志中看到以下错误:

{"level":"error","ts":"2021-02-18T22:17:07.281Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"ae85d0ac1629:f8102a0f-406a-4fc7-8abf-e4b3fd66a278","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: ae85d0ac1629:f8102a0f-406a-4fc7-8abf-e4b3fd66a278, taskListType: 0, rangeID: 14, db rangeID: 15","wf-task-list-name":"ae85d0ac1629:f8102a0f-406a-4fc7-8abf-e4b3fd66a278","wf-task-list-type":0,"number":1300001,"next-number":1300001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:52:03.740Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: 8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca, taskListType: 0, rangeID: 16, db rangeID: 17","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"number":1500002,"next-number":1500002,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:10:10.971Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"FeaTaskList","wf-task-list-type":1,"store-operation":"create-task","error":"Failed to create task. TaskList: FeaTaskList, taskListType: 1, rangeID: 94, db rangeID: 95","wf-task-list-name":"FeaTaskList","wf-task-list-type":1,"number":9300001,"next-number":9300001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:09:53.345Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: 8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca, taskListType: 0, rangeID: 14, db rangeID: 15","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"number":1300001,"next-number":1300001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"},
{"level":"error","ts":"2021-02-18T22:53:56.145Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"store-operation":"create-task","error":"Failed to create task. TaskList: 8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca, taskListType: 0, rangeID: 17, db rangeID: 18","wf-task-list-name":"8dd84fa9834d:258a1229-bdfd-4ef3-b315-ffbf749221ca","wf-task-list-type":0,"number":1600001,"next-number":1600001,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"}

我当前使用的docker映像版本是:ubercadence/server:0.15.1

有什么方法可以解决这个问题吗?

共有1个答案

杨经武
2023-03-14

我最好的猜测是问题BIND_ON_IP=0.0.0.0。每个实例都应使用唯一的主机 IP:端口作为其地址。由于它都是 0.0.0.0,因此每个服务只有在使用一个实例运行时才能工作。因为超过实例会有冲突。

然而,这对于前端服务来说不是问题,因为FE是状态的。匹配/历史会碰到这个问题——

HostA使用0.0.0.0:7935将其注册到mathcing服务,然后HostB尝试执行相同操作。这将导致一致哈希环不稳定。任务列表所有权一直在HostA和HostB之间切换。

要解决此问题,您需要让每个实例使用自己的主机 IP。就像在K8s中使用豆荚IP一样。

解决此问题后,您将在FE/历史记录中的日志中看到它们成功连接到两个匹配主机:

{"level":"info","ts":"2021-02-18T22:47:36.799Z","msg":"Current reachable members","component":"service-resolver","service":"cadence-matching","addresses":"[HostA_IP:7935, HostB_IP:7935]","logging-call-at":"rpServiceResolver.go:246"},

见Cadence Helm图表中的例子,我们如何为K8做到这一点:https://github.com/banzaicloud/banzai-charts/blob/87cf2946434c22cb963fea47b662ea85974ecfc0/cadence/templates/server-configmap.yaml#L82

 类似资料:
  • Kubernetes是一个高度开放可扩展的架构,可以通过自定义资源类型CRD来定义自己的类型,还可以自己来扩展API服务,用户的使用方式跟Kubernetes的原生对象无异。

  • 前几天,我们的节奏设置遇到了一些问题。我们的一个机器实例开始将CPU使用率提高到90%,所有入站工作流执行都停留在“计划”状态。检查日志后,我们注意到匹配的服务抛出了以下错误: 重启工作流后,一切都恢复正常,但我们仍在努力弄清楚发生了什么。在这个事件发生的那一刻,我们并没有带来任何繁重的工作负载,它只是突然发生的。我们的主要怀疑是,可能匹配服务在这个事件中失去了与cassandra数据库的连接,就

  • 扩展说明 当有多个服务提供方时,将多个服务提供方组织成一个集群,并伪装成一个提供方。 扩展接口 org.apache.dubbo.rpc.cluster.Cluster 扩展配置 <dubbo:protocol cluster="xxx" /> <!-- 缺省值配置,如果<dubbo:protocol>没有配置cluster时,使用此配置 --> <dubbo:provider cluster="

  • 配置keepalived服务 在每个seafile后端节点上安装和配置 keepalived 来实现浮动 IP 地址。 CentOS 7: yum install keepalived -y 假设配置了两个seafile后台任务节点:background1、background2 在background1上修改 keepalived 配置文件(/etc/keepalived/keepalived.

  • 3.4 配置业务集群 本节介绍如何配置业务服务器集群,对应Nginx配置文件中的upstream部分。 业务服务器真正处理网络请求,部署着web应用。 1.点击引导页面的“点击新增集群”按钮 2.在对话框中输入业务集群名称 3.进入业务集群配置界面。该界面可以主要分为三个部分: a. 集群信息: 配置集群名称、负载策略、长链接数、心跳检测规则、降级规则。 集群名称:对应upstream名称。为了保

  • { "cluster": "{...}", "refresh_delay_ms": "..." } clusters (required, object) 承载群集发现服务的上游群集的定义。群集必须实现并运行CDS HTTP API的REST服务。 refresh_delay_ms (optional, integer) 每次从CDS API刷新的延迟(以毫秒为单位)。Envoy将在0-