本地 k8s 集群跑 vitess operator 和 orchestrator,遇到不少坑,简单记录一下。
环境: ubuntu12.04 LTS,参考这里。
$ sudo kubeadm init --pod-network-cidr=10.244.0.0/16
根据文档,需要指定 cidr
,否则组网会有问题。
参考官方文档 aws 示例,把 operator.yaml
和 exampledb_aws.yaml
都下载到本地,并执行:
$ kubectl create -f ./operator.yaml
# 等待 operator 启动完成
$ kubectl create -f ./exampledb_aws.yaml
aws
示例使用的 s3
作为备份,需要改成本地 volume
,比如:
volume:
hostPath:
path: /tmp/
会将宿主机 /tmp
目录映射至 vttablet
的 backup
目录。
需要修改 exampledb_aws.yaml
里面定义的 init_db.sql
内容,在 vttablet
初始化的时候,显示创建用户,比如:
CREATE USER 'orchestrator'@'%' IDENTIFIED BY 'orchestrator';
GRANT SUPER, PROCESS, REPLICATION SLAVE, RELOAD
ON *.* TO 'orchestrator'@'%';
GRANT SELECT
ON _vt.* TO 'orchestrator'@'%';
apiVersion: apps/v1
kind: Deployment
metadata:
name: vitess-orchestrator
labels:
app: orchestrator
spec:
replicas: 1
selector:
matchLabels:
app: orchestrator
template:
metadata:
labels:
app: orchestrator
spec:
containers:
- name: orc
image: vitess/orchestrator
ports:
- containerPort: 3000
volumeMounts:
- name: config-volume
mountPath: /conf/
volumes:
- name: config-volume
configMap:
name: orchestrator-config
---
apiVersion: v1
kind: Service
metadata:
name: orchestrator
spec:
selector:
app: orchestrator
ports:
- protocol: TCP
port: 80
targetPort: 3000
---
apiVersion: v1
kind: ConfigMap
metadata:
name: orchestrator-config
data:
orchestrator.conf.json: |
{
"ActiveNodeExpireSeconds": 5,
"ApplyMySQLPromotionAfterMasterFailover": true,
"AuditLogFile": "/tmp/orchestrator-audit.log",
"AuditToSyslog": false,
"AuthenticationMethod": "",
"AuthUserHeader": "",
"AutoPseudoGTID": false,
"BackendDB": "sqlite",
"BinlogEventsChunkSize": 10000,
"CandidateInstanceExpireMinutes": 60,
"CoMasterRecoveryMustPromoteOtherCoMaster": false,
"DataCenterPattern": "[.]([^.]+)[.][^.]+[.]vitess[.]io",
"Debug": true,
"DefaultInstancePort": 3306,
"DetachLostSlavesAfterMasterFailover": true,
"DetectClusterAliasQuery": "SELECT value FROM _vt.local_metadata WHERE name='ClusterAlias'",
"DetectClusterDomainQuery": "",
"DetectInstanceAliasQuery": "SELECT value FROM _vt.local_metadata WHERE name='Alias'",
"DetectPromotionRuleQuery": "SELECT value FROM _vt.local_metadata WHERE name='PromotionRule'",
"DetectDataCenterQuery": "SELECT value FROM _vt.local_metadata WHERE name='DataCenter'",
"DetectPseudoGTIDQuery": "",
"DetectSemiSyncEnforcedQuery": "SELECT @@global.rpl_semi_sync_master_wait_no_slave AND @@global.rpl_semi_sync_master_timeout > 1000000",
"DiscoverByShowSlaveHosts": false,
"EnableSyslog": false,
"ExpiryHostnameResolvesMinutes": 60,
"DelayMasterPromotionIfSQLThreadNotUpToDate": true,
"FailureDetectionPeriodBlockMinutes": 10,
"GraphiteAddr": "",
"GraphiteConvertHostnameDotsToUnderscores": true,
"GraphitePath": "",
"HostnameResolveMethod": "none",
"HTTPAuthPassword": "",
"HTTPAuthUser": "",
"InstanceBulkOperationsWaitTimeoutSeconds": 10,
"InstancePollSeconds": 5,
"ListenAddress": ":3000",
"MasterFailoverLostInstancesDowntimeMinutes": 0,
"MySQLConnectTimeoutSeconds": 1,
"MySQLHostnameResolveMethod": "none",
"MySQLTopologyCredentialsConfigFile": "",
"MySQLTopologyMaxPoolConnections": 3,
"MySQLTopologyPassword": "orchestrator",
"MySQLTopologyReadTimeoutSeconds": 3,
"MySQLTopologySSLCAFile": "",
"MySQLTopologySSLCertFile": "",
"MySQLTopologySSLPrivateKeyFile": "",
"MySQLTopologySSLSkipVerify": true,
"MySQLTopologyUseMutualTLS": false,
"MySQLTopologyUser": "orchestrator",
"OnFailureDetectionProcesses": [
"echo 'Detected {failureType} on {failureCluster}. Affected replicas: {countSlaves}' >> /tmp/recovery.log"
],
"OSCIgnoreHostnameFilters": [
],
"PhysicalEnvironmentPattern": "[.]([^.]+[.][^.]+)[.]vitess[.]io",
"PostFailoverProcesses": [
"echo '(for all types) Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log"
],
"PostIntermediateMasterFailoverProcesses": [
"echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /tmp/recovery.log"
],
"PostMasterFailoverProcesses": [
"echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Promoted: {successorHost}:{successorPort}' >> /tmp/recove
ry.log",
"n=0; until [ $n -ge 10 ]; do vtctlclient -server example-vtctld-625ee430:15999 TabletExternallyReparented {successorAlias} && break; n=$[$n+1]; slee
p 5; done"
],
"PostponeSlaveRecoveryOnLagMinutes": 0,
"PostUnsuccessfulFailoverProcesses": [
],
"PowerAuthUsers": [
"*"
],
"PreFailoverProcesses": [
"echo 'Will recover from {failureType} on {failureCluster}' >> /tmp/recovery.log"
],
"ProblemIgnoreHostnameFilters": [
],
"PromotionIgnoreHostnameFilters": [
],
"PseudoGTIDMonotonicHint": "asc:",
"PseudoGTIDPattern": "drop view if exists .*?`_pseudo_gtid_hint__",
"ReadLongRunningQueries": false,
"ReadOnly": false,
"ReasonableMaintenanceReplicationLagSeconds": 20,
"ReasonableReplicationLagSeconds": 10,
"RecoverMasterClusterFilters": [
".*"
],
"RecoveryIgnoreHostnameFilters": [
],
"RecoveryPeriodBlockSeconds": 60,
"ReduceReplicationAnalysisCount": true,
"RejectHostnameResolvePattern": "",
"RemoveTextFromHostnameDisplay": ".vitess.io:3306",
"ReplicationLagQuery": "",
"ServeAgentsHttp": false,
"SkipBinlogEventsContaining": [
],
"SkipBinlogServerUnresolveCheck": true,
"SkipMaxScaleCheck": true,
"SkipOrchestratorDatabaseUpdate": false,
"SlaveStartPostWaitMilliseconds": 1000,
"SnapshotTopologiesIntervalHours": 0,
"SQLite3DataFile": ":memory:",
"SSLCAFile": "",
"SSLCertFile": "",
"SSLPrivateKeyFile": "",
"SSLSkipVerify": false,
"SSLValidOUs": [
],
"StaleSeedFailMinutes": 60,
"StatusEndpoint": "/api/status",
"StatusOUVerify": false,
"UnseenAgentForgetHours": 6,
"UnseenInstanceForgetHours": 240,
"UseMutualTLS": false,
"UseSSL": false,
"VerifyReplicationFilters": false
}
配置文件来自 vitess
代码库,注意,配置文件的用户名密码,需要在 VitessCluster
使用的 init_db.sql
中显示创建好,这里用的是 orchestrator:orchestrator
,同时也需要指定 vtctld
服务地址,最好创建一个固定可识别的 Service
。
集群启动后,发现 node 一直 NotReady
,没法调度 pod
,通过 describe node
,condition
显示 network plugin is not ready: cni config uninitialized
,通过安装 flannel
解决:
$ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
⚠️ 可能需要梯子
vitess
组件运行的时候,operator
会声明pvc
,需要安装一个 provisioner
才行,选用了rancher
的组件:
$ kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/master/deploy/local-path-storage.yaml
因为 rancher
注册的 storage class
是 local-path
,而 vitess operator
声明的 pvc
默认并不指定 StorageClass
,需要将 local-path
指定为 default
,否则 pvc
还是会一直处于 pending
状态。参考文档 修改:
$ kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
再次查看 StorageClass
:
$ kubectl get StorageClass
会有 default
字样
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
local-path (default) rancher.io/local-path Delete WaitForFirstConsumer false 81m
默认安装好的单节点集群,节点是master
,不会调度任何pod
,需要去掉taint
,参考这里:
$ kubectl taint nodes --all node-role.kubernetes.io/master-
简单 Google
了下,普遍反馈是 k8s 的问题,主动降级 microk8s
后好了
$ sudo snap remove microk8s
# 1.13 以后的版本,docker 被换成了 ctr
$ sudo snap install microk8s --classic --channel=1.13/stable
很诡异的问题,修改 init_db.sql
怎么都不生效,去 vitess slack channel 查历史记录,有多个人出现类似问题,看到 sougou 的回复,怀疑是 apparmor 的问题,于是把 snap 都删掉,直接用kubeadm
来安装集群,果然完美解决。
虽然国内镜像挺好使,但是有时,因为不可描述的原因,docker 镜像拉不下来,需要按照官方文档说明,设置好 dockerd
使用环境变量 HTTP_PROXY
,HTTPS_PROXY
和 NO_PROXY
。