An operator for synthetic monitoring on Kubernetes. Write your own tests in your own container and Kuberhealthy will manage everything else. Automatically creates and sends metrics to Prometheus and InfluxDB. Included simple JSON status page. Supplements other solutions like Prometheus very nicely!
Kuberhealthy is an operator for running synthetic checks. By creating a custom resource (a khcheck
) in your cluster, you can easily enable various synthetic test containers. Kuberhealthy does all the work of scheduling your checks on an interval you specify (like a CronJob), ensuring they run properly within an allotted timeout, maintaining the current up/down state with durability, and producing metrics. There are lots of useful checks already available to ensure the core functionality of Kubernetes, but checks can be used to test anything you like. We encourage you to write your own check container in any language to test your own applications!
Kuberhealthy serves a simple JSON status page, a Prometheus metrics endpoint (at /metrics
), and supports InfluxDB metric forwarding for integration into your choice of alerting solution.
Here is an illustration of how Kuberhealthy provisions and operates checker pods. In this example, the checker pod both deploys a daemonset and tears it down while carefully watching for errors. The result of the check is then sent back to Kuberhealthy and channeled into upstream metrics and status pages to indicate basic Kubernetes cluster functionality across all nodes in a cluster.
With Kuberhealthy, you can easily create synthetic tests to check your applications with real world use cases. Read more about how checks are configured in the documentation here and learn how to create your own check container in any language here. Clients for checks outside of Go can be found in the clients directory.
Requires Kubernetes 1.16 or above and Helm 3
kubectl create namespace kuberhealthy
kubectl config set-context --current --namespace=kuberhealthy
helm repo add kuberhealthy https://kuberhealthy.github.io/kuberhealthy/helm-repos
helm install kuberhealthy kuberhealthy/kuberhealthy
After installation, Kuberhealthy will only be available from within the cluster (Type: ClusterIP
) at the service URL kuberhealthy.kuberhealthy
. To expose Kuberhealthy to an external checking agent, you must edit the service kuberhealthy
and set Type: LoadBalancer
. This is done for security. Options are available in the Helm chart to bypass this and deploy with Type: LoadBalancer
directly.
Kuberhealthy is currently tested on Kubernetes 1.22.x
.
To configure Kuberhealthy after installation, see the configuration documentation.
Details on using the helm chart are documented here. The Helm installation of Kuberhealthy is automatically updated to use the latest Kuberhealthy release.
More installation options, including static yaml files are available in the /deploy directory. These flat spec files contain the most recent changes to Kuberhealthy, or the master branch. Use this if you would like to test master branch updates.
Instead of trying to identify all the things that could potentially go wrong in your application or cluster with never-ending metrics and alert configurations, synthetic tests replicate real workflow and carefully check for the expected behavior to occur. By default, Kuberhealthy monitors all basic Kubernetes cluster functionality including deployments, daemonsets, services, nodes, kube-system health and more.
Some examples of problems Kuberhealthy has detected in production with just the default checks enabled:
Terminating
due to CNI communication failuresContainerCreating
due to disk provisoning errorsPending
due to container runtime errorskube-system
namespace that has begun restarting too quicklyYou can directly access the current test statuses by accessing the kuberhealthy.kuberhealthy
HTTP service on port 80. The status page displays server status in the format shown below. The boolean OK
field can be used to indicate global up/down status, while the Errors
array will contain a list of all check error descriptions. Granular, per-check information, including how long the check took to run (Run Duration), the last time a check was run, and the Kuberhealthy pod ran that specific check is available under the CheckDetails
object.
{
"OK": true,
"Errors": [],
"CheckDetails": {
"kuberhealthy/daemonset": {
"OK": true,
"Errors": [],
"RunDuration": "22.512278967s",
"Namespace": "kuberhealthy",
"LastRun": "2019-11-14T23:24:16.7718171Z",
"AuthoritativePod": "kuberhealthy-67bf8c4686-mbl2j",
"uuid": "9abd3ec0-b82f-44f0-b8a7-fa6709f759cd"
},
"kuberhealthy/deployment": {
"OK": true,
"Errors": [],
"RunDuration": "29.142295647s",
"Namespace": "kuberhealthy",
"LastRun": "2019-11-14T23:26:40.7444659Z",
"AuthoritativePod": "kuberhealthy-67bf8c4686-mbl2j",
"uuid": "5f0d2765-60c9-47e8-b2c9-8bc6e61727b2"
},
"kuberhealthy/dns-status-internal": {
"OK": true,
"Errors": [],
"RunDuration": "2.43940936s",
"Namespace": "kuberhealthy",
"LastRun": "2019-11-14T23:34:04.8927434Z",
"AuthoritativePod": "kuberhealthy-67bf8c4686-mbl2j",
"uuid": "c85f95cb-87e2-4ff5-b513-e02b3d25973a"
},
"kuberhealthy/pod-restarts": {
"OK": true,
"Errors": [],
"RunDuration": "2.979083775s",
"Namespace": "kuberhealthy",
"LastRun": "2019-11-14T23:34:06.1938491Z",
"AuthoritativePod": "kuberhealthy-67bf8c4686-mbl2j",
"uuid": "a718b969-421c-47a8-a379-106d234ad9d8"
}
},
"CurrentMaster": "kuberhealthy-7cf79bdc86-m78qr"
}
Kuberhealthy scales horizontally in order to be fault tolerant. By default, two instances are used with a pod disruption budget and RollingUpdate strategy to ensure high availability.
The state of checks is centralized as custom resource records. This allows Kuberhealthy to always serve the same result, no matter which node in the pool you hit. The current master running checks is calculated by all nodes in the deployment by simply querying the Kubernetes API for 'Ready' Kuberhealthy pods of the correct label, and sorting them alphabetically by name. The node that comes first is master. These two strategies together enable Kuberhealthy to maintain state and scale horizontally without deploying an additional backing database.
Using Kuberhealthy with prometheus can help capture useful synthetic KPIs. Check out the K8s KPIs with Kuberhealthy doc to learn more on how to install Kuberhealthy and collect cluster KPIs.
By default, Kuberhealthy exposes an insecure (non-HTTPS) JSON status endpoint without authentication. You should never expose this endpoint to the public internet. Exposing Kuberhealthy's status page to the public internet could result in private cluster information being exposed to the public internet when errors occur and are displayed on the page.
Vulnerabilities or other security related issues should be logged as Github issues in this project. All new issues are reviewed regularly. Please be careful not to post any sensitive information in your report!
If you're interested in contributing to this project:
good first issue
tag.kubernetes 集群 在生产中运行Kubernetes集群是一项艰巨的任务,其中包含许多活动部件。 密切关注所有这些不同部分并非易事。 更糟糕的是,Kubernetes分布很广,并且经常自我修复。 如果集群中出现问题,则可能是断断续续的(或足够具体的),以致在很长一段时间内都不会出现损坏。 当然,在这段时间内,您的客户或开发人员的体验可能会下降或完全崩溃。 可能长期未引起注意的一些偷偷摸摸的
2019年11月,在圣地亚哥KubeCon,我们发布了kuberhealth 2.0.0——将kuberhealthy作为合成监测的Kubernetes operator。这个新功能为开发人员提供了创建自己的kuberhealth检查容器的方法,以合成监控其应用程序和集群。社区很快采用了这个新特性,感谢在自己的集群中实现和测试kuberhealth 2.0.0的每个人。 1部署Kuberhealt