概述
什么是 Kubernetes?
- 核心层:Kubernetes最核心的功能,对外提供API构建高层的应用,最内提供插件式应用执行环境
- 应用层:部署(无状态应用、有状态应用、批处理任务、集群应用等)和路由(服务发现、DNS解析等)
- 管理层:系统度量(如基础设施、容器和网络的度量),自动化(如自动扩展、动态Provision等)以及策略管理(RBAC、Quota、PSP、NetworkPolicy等)
- 接口层:kubectl命令行工具、客户端SDK以及集群联邦
- 生态系统:在接口层之上的庞大容器集群管理调度的生态系统,可以划分为两个范畴
- Kubernetes外部:日志、监控、配置管理、CI、CD、Workflow、FaaS、OTS应用、ChatOps等
- Kubernetes内部:CRI、CNI、CVI、镜像仓库、Cloud Provider、集群自身的配置和管理等
Kubernetes is an open-source platform for automating deployment, scaling, and operations of application containers across clusters of hosts, providing container-centric infrastructure.
特性:
- Deploy your applications quickly and predictably.
- Scale your applications on the fly.
- Seamlessly roll out new features.
- Optimize use of your hardware by using only the resources you need.
其他特点:
- portable: public, private, hybrid, multi-cloud
- extensible: modular, pluggable, hookable, composable
- self-healing: auto-placement, auto-restart, auto-replication, auto-scaling
Why containers?
- Agile application creation and deployment: Increased ease and efficiency of container image creation compared to VM image use.
- Continuous development, integration, and deployment: Provides for reliable and frequent container image build and deployment with quick and easy rollbacks (due to image immutability).
- Dev and Ops separation of concerns: Create application container images at build/release time rather than deployment time, thereby decoupling applications from infrastructure.
- Environmental consistency across development, testing, and production: Runs the same on a laptop as it does in the cloud.
- Cloud and OS distribution portability: Runs on Ubuntu, RHEL, CoreOS, on-prem, Google Container Engine, and anywhere else.
- Application-centric management: Raises the level of abstraction from running an OS on virtual hardware to run an application on an OS using logical resources.
- Loosely coupled, distributed, elastic, liberated micro-services: Applications are broken into smaller, independent pieces and can be deployed and managed dynamically – not a fat monolithic stack running on one big single-purpose machine.
- Resource isolation: Predictable application performance.
- Resource utilization: High efficiency and density.
Kubernetes提供的功能
- co-locating helper processes, facilitating composite applications and preserving the one-application-per-container model,
- mounting storage systems,
- distributing secrets,
- application health checking,
- replicating application instances,
- horizontal auto-scaling,
- naming and discovery,
- load balancing,
- rolling updates,
- resource monitoring,
- log access and ingestion,
- support for introspection and debugging, and
- identity and authorization.
总结:调度,管理,扩展(deployment/demon set/stateful set/job, health check,auto-scaling,rolling updates)应用程序,提供应用程序运行平台(日志,监控,服务发现,负载均衡,鉴权),以及管理控制和分配平台资源(内存,cpu,网络,存储,镜像)
我们看一下操作系统的定义
操作系统(Operating System, OS)是指控制和管理整个计算机系统的硬件和软件资源,并合理地组织调度计算机的工作和资源的分配,以提供给用户和其他软件方便的接口和环境的程序集合. kubernetes就是一个分布式的操作系统,它管理一个计算机集群的软件和硬件资源,并且合理的组织调用程序(容器)和资源的分配,以提供给用户和其他软件方便的接口和环境。
单机操作系统中的大多概念 都在k8s有或者正在有对应的形态。举个例子systemctl有reload操作,这个k8s也没有,但是是k8s正在做的。
Kubernetes不是什么
这段很有意思,很值得看,Kubernetes不是什么,里面很多都是Kubernetes发行商需要考虑和完成的事
- Does not limit the types of applications supported. It does not dictate application frameworks (e.g., Wildfly), restrict the set of supported language runtimes (for example, Java, Python, Ruby), cater to only 12-factor applications, nor distinguish apps from services. Kubernetes aims to support an extremely diverse variety of workloads, including stateless, stateful, and data-processing workloads. If an application can run in a container, it should run great on Kubernetes.
- Does not provide
middleware
(e.g., message buses), data-processing frameworks (for example, Spark), databases (e.g., mysql), nor cluster storage systems (e.g., Ceph) as built-in services. Such applications run on Kubernetes. - Does not have a
click-to-deploy service marketplace
. - Does not deploy source code and does not build your application.
Continuous Integration (CI) workflow
is an area where different users and projects have their own requirements and preferences, so it supports layering CI workflows on Kubernetes but doesn’t dictate how layering should work. - Allows users to choose their
logging
,monitoring
, andalerting systems
. (It provides some integrations as proof of concept.) - Does not provide nor mandate a
comprehensive application configuration language/system
(for example, jsonnet). - Does not provide nor adopt any
comprehensive machine configuration, maintenance, management, or self-healing systems
.
Kubernetes Components 组件
角色 | 组件 | 说明 | ||
---|---|---|---|---|
Master Components | kube-apiserver | kube-apiserver exposes the Kubernetes API; | ||
- | - | it is the front-end for the Kubernetes control plane. | ||
Master Components | etcd | Kubernetes’ backing store. stored All cluster data | ||
Master Components | kube-controller-manager | 一个binary包括: | ||
- | - | 1.Node Controller: noticing & responding when nodes go down. | ||
- | - | 2.Replication Controller:maintain correct number of pods for every Replication Controller object. - | - | 3.Endpoints Controller: Populates the Endpoints object (如join Services & Pods). |
- | - | 4.Service Account & Token Controllers:Create default accounts,API access tokens for namespaces. | ||
- | - | 5.others. | ||
Master Components | cloud-controller-manager | a binary run controllers interact with cloud providers.包括: | ||
- | - | 1.Node Controller: checking cloud provider,determine if node deleted in cloud after stops responding | ||
- | - | 2.Route Controller: For setting up routes in the underlying cloud infrastructure | ||
- | - | 3.Service Controller: For creating, updating and deleting cloud provider load balancers | ||
- | - | 4. Volume Controller: For creating,attaching,mounting,interacting with cloud provider to orchestrate volumes | ||
Master Components | kube-scheduler | kube-scheduler watches newly created pods that have no node assigned, and selects a node for them to run on. | ||
Master Components | addons | Addons are pods and services that implement cluster features. | ||
- | - | 如:DNS (Cluster DNS is a DNS server, in addition to the other DNS server(s) in your environment, which serves DNS records for Kubernetes services.), | ||
- | - | User interface,Container Resource Monitoring,Cluster-level Logging | ||
Node components | kubelet | primary node agent,主要功能: | ||
- | - | 1.Watches for pods that have been assigned to its node (either by apiserver or via local configuration file) | ||
- | - | 2.Mounts the pod’s required volumes | ||
- | - | 3.Downloads the pod’s secrets | ||
- | - | 4.Runs the pod’s containers via docker (or, experimentally, rkt). | ||
- | - | 5.Periodically executes any requested container liveness probes. | ||
- | - | 6.Reports the status of the pod back to the rest of the system, by creating a “mirror pod” if necessary | ||
- | - | 7.Reports the status of the node back to the rest of the system. | ||
Node components | kube-proxy | kube-proxy enables the Kubernetes service abstraction by maintaining network rules on the host and performing connection forwarding. | ||
Node components | docker/rkt | for actually running containers. | ||
Node components | supervisord | supervisord is a lightweight process babysitting system for keeping kubelet and docker running. | ||
Node components | fluentd | fluentd is a daemon which helps provide cluster-level logging. |
Kubernetes Objects - Kubernetes对象
Understanding Kubernetes Objects
分类
类别 | 名称 |
---|---|
资源对象 | Pod、ReplicaSet、ReplicationController、Deployment、StatefulSet、DaemonSet、Job、CronJob、HorizontalPodAutoscaling |
配置对象 | Node、Namespace、Service、Secret、ConfigMap、Ingress、Label、ThirdPartyResource、 ServiceAccount |
存储对象 | Volume、Persistent Volume |
策略对象 | SecurityContext、ResourceQuota、LimitRange |
Kubernetes Objects are persistent entities in the Kubernetes system. Kubernetes uses these entities to represent the state of your cluster. Specifically, they can describe:
- What containerized applications are running (and on which nodes) 应用
- The resources available to those applications 资源
- The policies around how those applications behave, such as restart policies, upgrades, and fault-tolerance 策略
Kubernetes Objects描述desired state
=> 状态驱动
Kubernetes对象就是应用,资源和策略
Object Spec and Status
每个对象都有两个嵌套的字段Object Spec 和 Object StatusObject Spec描述desired的状态
,Object Status 描述当前状态
. Object Status -》match Object Spec
Kubernetes Control Plane就是要让 object’s actual state => object's desired state
Name / NameSpace
略
Labels and Selectors
Labels are key/value pairs that are attached to objects, such as pods. Labels are intended to be used to specify identifying attributes of objects that are meaningful and relevant to users, but which do not directly imply semantics to the core system
.
不唯一
Via a label selector, the client/user can identify a set of objects. The label selector is the core grouping primitive in Kubernetes.
The API currently supports two types of selectors: equality-based
(如:environment = production)and set-based
(如:environment in (production, qa)).
API
例子见 kubernetes.io/docs/concep…
label 可用在 LIST and WATCH filtering;Set references in API objects
Set references in API objects的例子
Some Kubernetes objects, such as services and replicationcontrollers
, also use label selectors to specify sets of other resources, such as pods.但是支持equality-based requirement selectors
"selector": {
"component" : "redis",
}复制代码
Newer resources, such as Job, Deployment, Replica Set, and Daemon Set
, support set-based requirements as well.这些资源,同时支持set-based requirements
selector:
matchLabels:
component: redis
matchExpressions:
- {key: tier, operator: In, values: [cache]}
- {key: environment, operator: NotIn, values: [dev]}复制代码
另一个使用场景事用label来选择node
Annotations
作用是Attaching metadata to objects
和label有区别:
You can use either labels or annotations to attach metadata to Kubernetes objects. Labels can be used to select objects and to find collections of objects that satisfy certain conditions. In contrast, annotations are not used to identify and select objects. The metadata in an annotation can be small or large, structured or unstructured, and can include characters not permitted by labels.
The Kubernetes API
Complete API details are documented using Swagger v1.2
and OpenAPI
(就是Swagger 2.0).
API versioning
如:/api/v1, 根据稳定性分为 stabel(v1), alpha (v1alpha1), beta (v2beta3)
API groups
为了方便extend Kubernetes API
Currently there are several API groups in use:
- the
core
(oftentimes called “legacy”, due to not having explicit group name) group, which is at REST path /api/v1 and is not specified as part of the apiVersion field, e.g. apiVersion: v1. - the named groups are at REST path /apis/$GROUP_NAME/$VERSION, and use apiVersion: $GROUP_NAME/$VERSION (e.g. apiVersion: batch/v1, 再比如:/apis/apps/v1beta2/).
扩展api目前有两种方式: CustomResourceDefinition 和 kube-aggregator
某个api group可以在apiserver启动的时候被打开或者
关闭, 比如
--runtime-config=extensions/v1beta1/deployments=false,extensions/v1beta1/ingress=false复制代码
API Conventions
这部分来自 github.com/kubernetes/…
kinds可以分为三类
- Objects represent a persistent entity in the system.Examples: Pod, ReplicationController, Service, Namespace, Node
- Lists are collections of resources of one (usually) or more (occasionally) kinds.Examples: PodLists, ServiceLists, NodeLists
- Simple: used for specific actions on objects and for non-persistent entities.Many simple resources are "subresources",如/binding;/status;/scale;一个资源的小部分
Resources
All JSON objects returned by an API MUST have the following fields:
- kind: a string that identifies the schema this object should have
- apiVersion: a string that identifies the version of the schema the object should have
Objects
object内容 | 说明 |
---|---|
Metadata | MUST: namespace,name,uid; SHOULD: resourceVersion,generation,creationTimestamp,deletionTimestamp,labels,annotations |
Spec and Status | status (current) -> Spec(desired);A /status subresource MUST be provided to enable system components to update statuses of resources they manage; Status常是Conditions |
References to related objects | ObjectReference type |
Lists and Simple kinds
Differing Representations
Verbs on Resources
PATCH比较特别,支持三种patch
- JSON Patch
- Merge Patch
- Strategic Merge Patch
Idempotency
All compatible Kubernetes APIs MUST support "name idempotency" and respond with an HTTP status code 409
"confict"
Optional vs. Required
Optional fields have the following properties:
- They have +optional struct tag in Go.
- They are a pointer type in the Go definition or have a built-in nil value
- The API server should allow POSTing and PUTing a resource with this field unset
使用 +optional 而不是omitempty
Defaulting
Late Initialization
Concurrency Control and Consistency
使用resourceVersion来做Concurrency Control
All Kubernetes resources have a "resourceVersion" field as part of their metadata.
Kubernetes leverages the concept of resource versions to achieve optimistic concurrency.
The resourceVersion is changed by the server every time an object is modified.
Serialization Format
Units
Selecting Fields
Object references
HTTP Status codes
Response Status Kind
什么什么api会返回status kind类型
Kubernetes will always return the Status kind from any API endpoint when an error occurs. Clients SHOULD handle these types of objects when appropriate.
- A Status kind will be returned by the API in two cases:
- When an operation is not successful (i.e. when the server would return a non 2xx HTTP status code).
When a HTTP DELETE call is successful.
$ curl -v -k -H "Authorization: Bearer WhCDvq4VPpYhrcfmF6ei7V9qlbqTubUc" https://10.240.122.184:443/api/v1/namespaces/default/pods/grafana
> GET /api/v1/namespaces/default/pods/grafana HTTP/1.1
> User-Agent: curl/7.26.0
> Host: 10.240.122.184
> Accept: */*
> Authorization: Bearer WhCDvq4VPpYhrcfmF6ei7V9qlbqTubUc
>
< HTTP/1.1 404 Not Found
< Content-Type: application/json
< Date: Wed, 20 May 2015 18:10:42 GMT
< Content-Length: 232
<
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "pods \"grafana\" not found",
"reason": "NotFound",
"details": {
"name": "grafana",
"kind": "pods"
},
"code": 404
}复制代码
Events
Naming conventions
Label, selector, and annotation conventions
WebSockets and SPDY
The API therefore exposes certain operations over upgradeable HTTP connections (described in RFC 2817) via the WebSocket and SPDY protocols.
支持两种协议
- Streamed channels: Kubernetes supports a SPDY based framing protocol that leverages SPDY channels and a WebSocket framing protocol that multiplexes multiple channels onto the same stream by prefixing each binary chunk with a byte indicating its channel
- Streaming response: HTTP Chunked Transfer-Encoding
Validation
Kubernetes Architecture
Nodes
Node Status | 描述 | |
---|---|---|
Addresses | HostName/ExternalIP/InternalIP | |
Condition | OutOfDisk / Ready / MemoryPressure / DiskPressure / NetworkUnavailable | |
Capacity | ||
Info |
Management
Node Controller
The node controller is a Kubernetes master component which manages various aspects of nodes.
作用:
- assigning a CIDR block to the node when it is registered
- keeping the node controller’s internal list of nodes up to date with the cloud provider’s list of available machines
- monitoring the nodes’ health
- Starting in Kubernetes 1.6, the NodeController is also responsible for evicting pods that are running on nodes with NoExecute
- Starting in version 1.8, the node controller can be made responsible for creating taints that represent Node conditions.
Master-Node communication
Concepts Underlying the Cloud Controller Manager
The CCM consolidates all of the cloud-dependent logic from the preceding three components to create a single point of integration with the cloud. The new architecture with the CCM looks like this
TODO
Extending the Kubernetes API
Custom Resources
Custom resources
Custom controllers
CustomResourceDefinitions
API server aggregation
Extending the Kubernetes API with the aggregation layer
Containers
Images
Updating Images
The default pull policy is IfNotPresent
which causes the Kubelet to not pull an image if it already exists.
如果要强制拉取,使用imagePullPolicy: Always
, 推荐的做法是 "Vxx + IfNotPresent", 而不是"latest + Always",因为不知道正在运行的是什么版本,但是实际上pull是调用docker这样的runtime去pull, 即使Always也不会重复下载大量数据,因为layer已经存在来,从这方面讲Always是无害的。
Using a Private Registry
可用:
Using Google Container Registry
Using AWS EC2 Container Registry
Using Azure Container Registry (ACR)
Configuring Nodes to Authenticate to a Private Repository
通过$HOME/.docker/config.json (过期问题??)
Pre-pulling Images
Specifying ImagePullSecrets on a Pod
Creating a Secret with a Docker Config
$ kubectl create secret docker-registry myregistrykey --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL
secret "myregistrykey" created.复制代码
Bypassing kubectl create secrets
不通过kubectl也可以从.docker/config.json的内容,用yaml创建secrets
Referring to an imagePullSecrets on a Pod
怎么使用创建出来的imagePullSecrets
可以在podspec里面指定,也可以通过serviceaccount自动完成这个设定。
You can use this in conjunction with a per-node .docker/config.json. The credentials will be merged
. This approach will work on Google Container Engine (GKE).
apiVersion: v1
kind: Pod
metadata:
name: foo
namespace: awesomeapps
spec:
containers:
- name: foo
image: janedoe/awesomeapp:v1
imagePullSecrets:
- name: myregistrykey复制代码
Use Cases
使用场景,值得注意的是 AlwaysPullImages admission controller,这个有时候要打开,比如多租户的情况,否则有可能获取别人的镜像。
Container Environment Variables
Container information
- pod information等很多元数据信息可以通过 downward API 挂成环境变量
- secret也可以挂成环境变量
- pod spec中自定义的环境变量
具体多种挂在方式 元数据->container里面的文件/环境变量,参考 kubernetes.io/docs/tasks/… 和相关文档
Cluster information
创建的时候存在的service host/port作为变量都会挂在container里面(目前看是这个namespace的),这个特性保证了即使没开dns addon,也可以访问service,当然这种方式不可靠。
Container Lifecycle Hooks
Container Hooks
Hook Details
现在有两种 PostStart; PreStop,如果hook调用hangs,Pod状态变化会阻塞。
- PostStart:executes immediately after a container is created. 不保证在ENTRYPOINT前面执行
- PreStop: called immediately before a container is terminated, 同步执行,最多可以执行的时间和grace period 有关
Hook Handler Implementations
支持Exec,HTTP两种方式
Hook Handler Execution
- Hook handler calls are synchronous within the context of the Pod containing the Container. This means that for a PostStart hook, the Container ENTRYPOINT and hook fire asynchronously. However, if the hook takes too long to run or hangs, the Container cannot reach a running state.
- The behavior is similar for a PreStop hook. If the hook hangs during execution, the Pod phase stays in a Terminating state and is killed after terminationGracePeriodSeconds of pod ends. If a PostStart or PreStop hook fails, it kills the Container.
从上面的特点可以看出,PostStart; PreStop的目前的设计都是针对非常轻量级的命令,如果不是可以考虑用initcontainer,defercontainer(还没实现,有issue)
Hook delivery guarantees
一般只会发一次,但是不保证
Debugging Hook Handlers
If a handler fails for some reason, it broadcasts an event.
You can see these events by running kubectl describe pod
Workloads
Pods
Pod Overview
Pod是什么:部署的最小单位; 涵盖了一个或多个application container,(共用的)存储资源,网络IP,options
A Pod encapsulates an application container (or, in some cases, multiple containers), storage resources, a unique network IP, and options that govern how the container(s) should run. A Pod represents a unit of deployment: a single instance of an application in Kubernetes, which might consist of either a single container or a small number of containers that are tightly coupled and that share resources.
参考:
blog.kubernetes.io/2015/06/the… (一个pod多个container的use case:Sidecar (git, log...), Ambassador (proxy, 透明代理),Adapter (exporter)...)
blog.kubernetes.io/2016/06/con…
Understanding Pods
How Pods manage multiple Containers
一个例子:
multiple Containers共享:
- Networking
- Storage
Working with Pods
Pods are designed as relatively ephemeral, disposable entities.Pods do not, by themselves, self-heal
,Kubernetes uses a higher-level abstraction, called a Controller, that handles the work of managing the relatively disposable Pod instances.
Pods and Controllers
A Controller can create and manage multiple Pods for you, handling replication and rollout and providing self-healing capabilities at cluster scope
. For example, if a Node fails, the Controller might automatically replace the Pod by scheduling an identical replacement on a different Node.
Some examples of Controllers that contain one or more pods include:
- Deployment
- StatefulSet
- DaemonSet
Pod Templates
Controllers use Pod Templates to make actual pods.
没有 desired state of all replicas,不像pod,会规定desired state of all containers belonging to the pod.
Pod Lifecycle
Pod phase
A Pod’s status field is a PodStatus object, which has a phase field.
可能的状态 | 说明 |
---|---|
Pending | The Pod has been accepted by the Kubernetes system, but one or more of the Container images has not been created. |
Running | The Pod has been bound to a node, and all of the Containers have been created. At least one Container is still running, or is in the process of starting or restarting |
Succeeded | All Containers in the Pod have terminated in success, and will not be restarted. |
Failed | All Containers in the Pod have terminated, and at least one Container has terminated in failure. |
Unknown |
pod 终止
- 用户发送删除pod的命令,默认优雅删除时期是30秒;
- 在Pod超过该优雅删除期限后API server就会更新Pod的状态为“dead”;
- 在客户端命令行上显示的Pod状态为“terminating”;
- 跟第三步同时,当kubelet发现pod被标记为“terminating”状态时,开始停止pod进程:
- 如果在pod中定义了preStop hook,在停止pod前会被调用。如果在优雅删除期限过期后,preStop hook依然在运行,第二步会再增加2秒的优雅时间;
- 向Pod中的进程发送TERM信号;
- 跟第三步同时,该Pod将从该service的端点列表中删除,不再是replication controller的一部分。关闭的慢的pod将继续处理load balancer转发的流量;
- 过了优雅周期后,将向Pod中依然运行的进程发送SIGKILL信号而杀掉进程。
- Kublete会在API server中完成Pod的的删除,通过将优雅周期设置为0(立即删除)。Pod在API中消失,并且在客户端也不可见。
Pod conditions
A Pod has a PodStatus, which has an array of PodConditions.Each element of the PodCondition array has a type field and a status field.
status:
conditions:
- lastProbeTime: null
lastTransitionTime: 2017-10-28T06:30:03Z
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: 2017-10-28T06:30:13Z
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: 2017-10-28T06:30:03Z
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://dd82608cabe226247bcbc8d5fbce6121edf935320486c41046481000dbb7784f
image: deis/brigade-api:latest
imageID: docker-pullable://deis/brigade-api@sha256:943cf822adddf6869ff02d2e1a55cbb19c96d01be41e88d1d56bc16a50f5c91f
lastState: {}
name: brigade
ready: true
restartCount: 0
state:
running:
startedAt: 2017-10-28T06:30:06Z复制代码
Container probes
A Probe is a diagnostic performed periodically by the kubelet on a Container. To perform a diagnostic, the kublet calls a Handler implemented by the Container.
三种检测方式:
- ExecAction
- TCPSocketAction
- HTTPGetAction
三种结果: Success,Failure,Unknown
两种类型:livenessProbe(和restart policy相关),readinessProbe
When should you use liveness or readiness probes?
todo
Pod and Container status
Restart policy
Pod lifetime
- Use a
Job
for Pods that are expected to terminate, for example, batch computations. Jobs are appropriate only for Pods with restartPolicy equal to OnFailure or Never. - Use a
ReplicationController, ReplicaSet, or Deployment
for Pods that are not expected to terminate, for example, web servers. ReplicationControllers are appropriate only for Pods with a restartPolicy of Always. - Use a
DaemonSet
for Pods that need torun one per machine
, because they provide a machine-specific system service.
Examples
pod 只有一个container
这里比较值得注意的是如果pod设计成run to complete的,那么restartPolicy不能用Always
当前pod phase | container发生事件 | pod restartPolicy | 对container的动作 | log | pod phase |
---|---|---|---|---|---|
Running | exits with success | Always | Restart Container | Log completion event | Running |
Running | exits with success | OnFailure | - | Log completion event | Succeeded |
Running | exits with success | Never | - | Log completion event | Succeeded |
Running | exits with failure | Always | Restart Container | Log failure event | Running |
Running | exits with failure | OnFailure | Restart Container | Log failure event | Running |
Running | exits with failure | Never | - | Log failure event | Failed |
Running | oom | Always | Restart Container | Log OOM event | Running |
Running | oom | OnFailure | Restart Container | Log OOM event | Running |
Running | oom | Never | - | Log OOM event | Failed |
pod 只有两个container
当前pod phase | container1发生事件 | pod restartPolicy | 对container的动作 | log | pod phase |
---|---|---|---|---|---|
Running | exits with failure | Always | Restart Container | Log failure event | Running |
Running | exits with failure | OnFailure | Restart Container | Log failure event | Running |
Running | exits with failure | Never | - | Log failure event | Running, 如果container2也退出 =》Failed |
Init Containers
常用来做set-up,或者等待set-up
Init Containers are exactly like regular Containers, except:
- They always run to completion.
- Each one must complete successfully before the next one is started.
Detailed behavior
- A Pod cannot be Ready until all Init Containers have succeeded.
- If the Pod is restarted, all Init Containers must execute again.
- readinessProbe什么的不能使用
- Use activeDeadlineSeconds on the Pod and livenessProbe on the Container to prevent Init Containers from failing forever.
Pod Preset
pod preset,是一种给pod注入元数据的方法。
使用pod preset会决定对某一类的pod,在Admission controller那里透明的对pod spec进行修改,给pod动态的注入依赖的一些信息,如env,mount volumns
表现:
当PodPreset被应用于一个或者多个Pod,Kubernetes修改pod的spec。对于Env,EnvFrom和VolumeMounts,Kubernetes修改了Pod里面所有容器的spec;对于Volume Kubernetes修改了Pod Spec。
例子:
kind: PodPreset
apiVersion: settings.k8s.io/v1alpha1
metadata:
name: allow-database
namespace: myns
spec:
selector:
matchLabels:
role: frontend
env:
- name: DB_PORT
value: "6379"
volumeMounts:
- mountPath: /cache
name: cache-volume
volumes:
- name: cache-volume
emptyDir: {}复制代码
参考www.jianshu.com/p/83fe99a5e…
Pod 安全策略
包含 PodSecurityPolicy 的 许可控制,允许控制集群资源的创建和修改,基于这些资源在集群范围内被许可的能力。
如果某个策略能够匹配上,该 Pod 就被接受。如果请求与 PSP 不匹配,则 Pod 被拒绝
Disruptions
Voluntary and Involuntary Disruptions
unavoidable cases 即 involuntary disruptions to an application. =>比如: hardware failure,kernel panic,node disappears,eviction of a pod due to the node being out-of-resources.等等
voluntary disruptions => 比如: deleting/updating the deployment/pod, Draining a node for repair or upgrade or cluster down.
Dealing with Disruptions
如何减轻Involuntary Disruptions的影响: 指名要的资源, Replicate and spread.
- Ensure your pod requests the resources it needs.
- Replicate your application if you need higher availability
- For even higher availability when running replicated applications, spread applications across racks (using anti-affinity) or across zones (if using a multi-zone cluster.)
How Disruption Budgets Work
在Kubernetes中,为了保证业务不中断或业务SLA不降级,需要将应用进行集群化部署。通过PodDisruptionBudget控制器可以设置应用POD集群处于运行状态最低个数,也可以设置应用POD集群处于运行状态的最低百分比,这样可以保证在主动销毁应用POD的时候,不会一次性销毁太多的应用POD,从而保证业务不中断或业务SLA不降级。
使用那种调用Eviction API 的工具而不是直接删除POD,因为Eviction API 会respect Pod Disruption Budgets,比如 kubectl drain命令。
- PDBs cannot prevent involuntary disruptions from occurring, but they do count against the budget.
- Pods which are deleted or unavailable due to a rolling upgrade to an application do count against the disruption budget,, but controllers (like deployment and stateful-set) are not limited by PDBs when doing rolling upgrades – the handling of failures during application updates is configured in the controller spec.
- When a pod is evicted using the eviction API, it is gracefully terminated
参考:
www.kubernetes.org.cn/2486.html
ju.outofmemory.cn/entry/32756…
PDB Example
Separating Cluster Owner and Application Owner Roles
How to perform Disruptive Actions on your Cluster
Write disruption tolerant applications and use PDBs
Controllers
Replica Sets
一般不直接用,而是通过Deployments.
mainly used by Deployments as a mechanism to orchestrate pod creation, deletion and updates.
When to use a ReplicaSet
A ReplicaSet ensures that a specified number of pod replicas are running at any given time.
Working with ReplicaSets
一些操作:
- kubectl delete. Kubectl will scale the ReplicaSet to zero and wait for it to delete each pod before deleting the ReplicaSet itself
- --cascade=false会只删除ReplicaSets,不删pod
- 通过修改pod的label,可以Isolating pods from a ReplicaSet,remove之后会被replaced automatically
- scale: .spec.replica
- ReplicaSet can also be a target for Horizontal Pod Autoscalers (HPA). 自动scale
Replication Controller
略,现在不推荐了。
Deployments
A Deployment controller provides declarative updates
for Pods and ReplicaSets.
Use case
- Create a Deployment to rollout a ReplicaSet. The ReplicaSet creates Pods in the background. Check the status of the rollout to see if it succeeds or not.
- Declare the new state of the Pods by updating the PodTemplateSpec of the Deployment. A new ReplicaSet is created and the Deployment manages moving the Pods from the old ReplicaSet to the new one at a controlled rate. Each new ReplicaSet updates the revision of the Deployment.
- Rollback to an earlier Deployment revision if the current state of the Deployment is not stable. Each rollback updates the revision of the Deployment.
- Scale up the Deployment to facilitate more load.
- Pause the Deployment to apply multiple fixes to its PodTemplateSpec and then resume it to start a new rollout.
- Use the status of the Deployment as an indicator that a rollout has stuck.
- Clean up older ReplicaSets that you don’t need anymore
Create
Pod-template-hash label: this label ensures that child ReplicaSets of a Deployment do not overlap. It is generated by hashing the PodTemplate of the ReplicaSet and using the resulting hash as the label value that is added to the ReplicaSet selector, Pod template labels, and in any existing Pods that the ReplicaSet might have.
Update
Deployment can ensure that only a certain number of Pods may be down while they are being updated. By default, it ensures that at least 1 less than the desired number of Pods are up (1 max unavailable).
rollout, rollout history/status, undo......
Scaling
Proportional scaling: RollingUpdate (maxSurge,maxUnavailable)可能短暂大于预期数量
$ kubectl get deploy
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
nginx-deployment 10 10 10 10 50s
$ kubectl set image deploy/nginx-deployment nginx=nginx:sometag
deployment "nginx-deployment" image updated
$ kubectl get rs
NAME DESIRED CURRENT READY AGE
nginx-deployment-1989198191 5 5 0 9s
nginx-deployment-618515232 8 8 8 1m复制代码
Pausing and Resuming
Deployment status
Clean up Policy
You can set .spec.revisionHistoryLimit field in a Deployment to specify how many old ReplicaSets for this Deployment you want to retain
注意:目前不支持Canary Deployment,推荐用multiple Deployment来实现
StatefulSets
since 1.5 取代PetSets,特点是:Manages the deployment and scaling of a set of Pods, and provides guarantees about the ordering
and uniqueness
of these Pods.
stateful意味着:
- Stable, unique network identifiers.
- Stable, persistent storage.
- Ordered, graceful deployment and scaling. (deployment的滚动没有这么严格)
- Ordered, graceful deletion and termination.
Components
components of a StatefulSet.例子
A Headless Service
(带selector), named nginx, is used to control the network domain.这种service不带lb,kube-proxy不处理,dns直接返回后端endpoint- The
StatefulSet
, named web, has a Spec that indicates that 3 replicas of the nginx container will be launched in unique Pods. - The
volumeClaimTemplates
will provide stable storage using PersistentVolumes provisioned by a PersistentVolume Provisioner.
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1beta2
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx # has to match .spec.template.metadata.labels
serviceName: "nginx"
replicas: 3 # by default is 1
template:
metadata:
labels:
app: nginx # has to match .spec.selector.matchLabels
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: gcr.io/google_containers/nginx-slim:0.8
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: my-storage-class
resources:
requests:
storage: 1Gi复制代码
Pod Identity
- Ordinal Index:each Pod in the StatefulSet will be assigned an integer ordinal, in the range [0,N), that is unique over the Set
- Stable Network ID:The pattern for the constructed hostname is $(statefulset name)-$(ordinal). The example above will create three Pods named web-0,web-1,web-2. A StatefulSet can use a Headless Service to control the domain of its Pods.
- Stable Storage:PersistentVolume
Cluster Domain | Service (ns/name) | StatefulSet (ns/name) | StatefulSet Domain | Pod DNS | Pod Hostname | |
---|---|---|---|---|---|---|
cluster.local | default/nginx | default/web | nginx.default.svc.cluster.local | web-{0..N-1}.nginx.default.svc.cluster.local | web-{0..N-1} | |
cluster.local | foo/nginx | foo/web | nginx.foo.svc.cluster.local | web-{0..N-1}.nginx.foo.svc.cluster.local | web-{0..N-1} | |
kube.local | foo/nginx | foo/web | nginx.foo.svc.kube.local | web-{0..N-1}.nginx.foo.svc.kube.local | web-{0..N-1} |
Deployment and Scaling Guarantees
- For a StatefulSet with N replicas, when Pods are being deployed, they are created sequentially, in order from {0..N-1}.
- When Pods are being deleted, they are terminated in reverse order, from {N-1..0}.
- Before a scaling operation is applied to a Pod, all of its predecessors must be Running and Ready.
- Before a Pod is terminated, all of its successors must be completely shutdown.
In Kubernetes 1.7 and later, StatefulSet allows you to relax its ordering guarantees while preserving its uniqueness and identity guarantees via its .spec.podManagementPolicy field.
Update Strategies
On Delete;Rolling Updates;Partitions
Daemon Sets
一个node跑一个pod,作为一个deamon
Alternatives to DaemonSet
- Init Scripts
- Static Pods: create Pods by writing a file to a certain directory watched by Kubelet.
Garbage Collection
Owners and dependents
When you delete an object, you can specify whether the object’s dependents are also deleted automatically. Deleting dependents automatically is called cascading deletion
.There are two modes of cascading deletion: background and foreground.
前台删除:根对象首先进入 “删除中” 状态。=> 垃圾收集器会删除对象的所有 Dependent。 => 删除 Owner 对象。
后台删除:Kubernetes 会立即删除 Owner 对象,然后垃圾收集器会在后台删除这些 Dependent。
Deployments必须使用propagationPolicy: Foreground
自定义资源目前不支持垃圾回收
Setting the cascading deletion policy
To control the cascading deletion policy, set the deleteOptions.propagationPolicy field on your owner object. Possible values include “Orphan”, “Foreground”, or “Background”.
The default garbage collection policy for many controller resources is orphan, including ReplicationController, ReplicaSet, StatefulSet, DaemonSet, and Deployment.
Jobs - Run to Completion
todo
Cron Jobs
todo
Configuration
Configuration Best Practices
这个优点像effective k8s了:
General Config Tips
- 配置要带版本,可以回滚
- YMAL比JSON好
- ALL IN ONE YAML
- Don’t specify default values unnecessarily – simple and minimal configs will reduce errors.
- Put an object description in an annotation to allow better introspection.
Services
- 先创建service,后创建rc, This lets the scheduler spread the pods that comprise the service.
- Don’t use hostPort (使用a NodePort service) and hostNetwork unless it is absolutely necessary
- Use headless services for easy service discovery when you don’t need kube-proxy load balancing.
Using Labels
todo
Container Images
略
Using kubectl
- Use kubectl create -f where possible.
- Use kubectl run and expose to quickly create and expose single container Deployments.
Managing Compute Resources for Containers
Resource requests and limits
todo
Assigning Pods to Nodes
nodeSelector
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
disktype: ssd复制代码
Interlude: built-in node labels
kubernetes.io/hostname
failure-domain.beta.kubernetes.io/zone
failure-domain.beta.kubernetes.io/region
beta.kubernetes.io/instance-type
beta.kubernetes.io/os
beta.kubernetes.io/arch复制代码
Affinity and anti-affinity
- nodeAffinity
- requiredDuringSchedulingIgnoredDuringExecution
- preferredDuringSchedulingIgnoredDuringExecution
- podAffinity
- requiredDuringSchedulingIgnoredDuringExecution
- preferredDuringSchedulingIgnoredDuringExecution
- podAntiAffinity
- requiredDuringSchedulingIgnoredDuringExecution
- preferredDuringSchedulingIgnoredDuringExecution
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/e2e-az-name
operator: In
values:
- e2e-az1
- e2e-az2
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: another-node-label-key
operator: In
values:
- another-node-label-value
containers:
- name: with-node-affinity
image: gcr.io/google_containers/pause:2.0复制代码
Taints and Tolerations
Node affinity, 写在pod上,描述希望什么node。
Secrets
Organizing Cluster Access Using kubeconfig Files
Pod Priority and Preemption
Cluster Administration
Managing Resources
Organizing resource configurations
- 不同resource可以写在一个yml文件里
- kubectl create/delete.. -f 文件,url,文件夹 (--recursive) 创建,删除
- kubectl get 获取
- kubectl label/annotate 标注
- kubectl scale/autoscale/apply/edit/patch/replace 更新
Cluster Networking
- Highly-coupled container-to-container communications: this is solved by pods and localhost communications.
- Pod-to-Pod communications: this is the primary focus of this document.
- Pod-to-Service communications: this is covered by services.
- External-to-Service communications: this is covered by services.
Kubernetes model
- all containers can communicate with all other containers without NAT
- all nodes can communicate with all containers (and vice-versa) without NAT
- the IP that a container sees itself as is the same IP that others see it as
实现:Contiv,Contrail,Flannel,GCE,L2 networks and linux bridging,Nuage,OpenVSwitch,OVN,Calico,Romana,Weave Net
Network Plugins
- CNI plugins: adhere to the appc/CNI specification, designed for interoperability.
- Kubenet plugin: implements basic cbr0 using the bridge and host-local CNI plugins
Logging and Monitoring Cluster Activity
Auditing
Kubernetes audit is part of kube-apiserver logging all requests coming to the server.
Resource Usage Monitoring
Configuring Out Of Resource Handling
Eviction Policy
The kubelet can pro-actively monitor for and prevent against total starvation of a compute resource. In those cases, the kubelet can pro-actively fail one or more pods in order to reclaim the starved resource. When the kubelet fails a pod, it terminates all containers in the pod, and the PodPhase is transitioned to Failed.
Eviction Thresholds:
- A
soft eviction threshold
pairs an eviction threshold with a required administrator specified grace period - A
hard eviction threshold
has no grace period, and if observed, the kubelet will take immediate action to reclaim the associated starved resource
Using Multiple Clusters
Federation
Federation makes it easy to manage multiple clusters. It does so by providing 2 major building blocks:
- Sync resources across clusters: Federation provides the ability to keep resources in multiple clusters in sync. This can be used, for example, to ensure that the same deployment exists in multiple clusters.
- Cross cluster discovery: It provides the ability to auto-configure DNS servers and load balancers with backends from all clusters. This can be used, for example, to ensure that a global VIP or DNS record can be used to access backends from multiple clusters.
Setting up Cluster Federation with Kubefed
Cross-cluster Service Discovery using Federated Services
Guaranteed Scheduling For Critical Add-On Pods
Rescheduler ensures that critical add-ons are always scheduled. If the scheduler determines that no node has enough free resources to run the critical add-on pod given the pods that are already running in the cluster the rescheduler tries to free up space for the add-on by evicting some pods; then the scheduler will schedule the add-on pod.
可以设置一个临时的taint "CriticalAddonsOnly",只用来部署Critical Add-On Pod,防止其他pod调度上去
Static Pods
Static pods are managed directly by kubelet daemon on a specific node, without API server observing it. It does not have associated any replication controller, kubelet daemon itself watches it and restarts it when it crashes. There is no health check though. Static pods are always bound to one kubelet daemon and always run on the same node with it.
Kubelet automatically creates so-called mirror pod
on Kubernetes API server for each static pod, so the pods are visible
there, but they cannot be controlled from the API server
.
If you are running clustered Kubernetes and are using static pods to run a pod on every node, you should probably be using a DaemonSet!
可以通过--pod-manifest-path 或者 --manifest-url设置
Using Sysctls in a Kubernetes Cluster
- In Linux, the sysctl interface allows an administrator to modify kernel parameters at runtime. Parameters are available via the /proc/sys/ virtual process file system.
- A number of sysctls are namespaced in today’s Linux kernels. This means that they can be set independently for each pod on a node.
Safe sysctl
: In addition to proper namespacing a safe sysctl must be properly isolated between pods on the same node.
Accessing Clusters
//访问restapi 方式
// 1. proxy
kubectl proxy --port=8083 &
curl localhost:8083/api
// 2.直接访问
$ APISERVER=$(kubectl config view | grep server | cut -f 2- -d ":" | tr -d " ")
$ TOKEN=$(kubectl describe secret $(kubectl get secrets | grep default | cut -f1 -d ' ') | grep -E '^token' | cut -f2 -d':' | tr -d '\t')
$ curl $APISERVER/api --header "Authorization: Bearer $TOKEN" --insecure复制代码
several options for connecting to nodes, pods and services from outside the cluster:
- Access services through public IPs: Use a service with type NodePort or LoadBalancer to make the service reachable outside the cluster. See
- Access services, nodes, or pods using the Proxy Verb : Does apiserver authentication and authorization prior to accessing the remote service. Use this if the services are not secure enough to expose to the internet, or to gain access to ports on the node IP, or for debugging.
- Access from a node or pod in the cluster : Run a pod, and then connect to a shell in it using kubectl exec. Connect to other nodes, pods, and services from that shell.
//Discovering builtin services
kubectl cluster-info复制代码
Kubernetes proxy种类
- The kubectl proxy: - runs on a user’s desktop or in a pod - proxies from a localhost address to the Kubernetes apiserver - client to proxy uses HTTP - proxy to apiserver uses HTTPS - locates apiserver - adds authentication headers
- The apiserver proxy: - is a bastion built into the apiserver - connects a user outside of the cluster to cluster IPs which otherwise might not be reachable - runs in the apiserver processes - client to proxy uses HTTPS (or http if apiserver so configured) - proxy to target may use HTTP or HTTPS as chosen by proxy using available information - can be used to reach a Node, Pod, or Service - does load balancing when used to reach a Service
- The kube proxy: - runs on each node - proxies UDP and TCP - does not understand HTTP - provides load balancing - is just used to reach services
- A Proxy/Load-balancer in front of apiserver(s): - existence and implementation varies from cluster to cluster (e.g. nginx) - sits between all clients and one or more apiservers - acts as load balancer if there are several apiservers.
- Cloud Load Balancers on external services: - are provided by some cloud providers (e.g. AWS ELB, Google Cloud Load Balancer) - are created automatically when the Kubernetes service has type LoadBalancer - use UDP/TCP only - implementation varies by cloud provider.
Authenticating Across Clusters with kubeconfig
Storage
Volumes
Persistent Volumes
Services, Load Balancing, and Networking
- Pod 是mortal的,但是Pod IP addresses cannot be relied upon to be stable over time
- 所以要使用
Services
- Service is (usually) determined by a Label Selector
- For Kubernetes-native applications, Kubernetes offers a simple
Endpoints API
that is updated whenever the set of Pods in a Service changes. For non-native applications, Kubernetes offers a virtual-IP-based bridge to Services which redirects to the backend Pods - 创建service会用selector创建endpoint选择后端,也可以不用selector,手动创建同名endpoint,或者使用
type: ExternalName
转发流量到external service - 除了ExternalName,service的virtual IP由kube-proxy实现
- Ingress 7层->Services 4层
- port和nodePort都是service的端口,前者暴露给集群内客户访问服务
- service的负载均衡有两种模式,流量过kubeproxy或者iptables
- {SVCNAME}_SERVICE_HOST,{SVCNAME}_SERVICE_POR等环境变量会被注入pod
- 设置spec.clusterIP = None => Headless service => 域名则对于所有Endpoints
- ServiceType:
ClusterIP(default)
,NodePort
(会在每个node上都开一个端口->service),LoadBalancer
(依赖iaas,会有一个EXTERNAL-IP),ExternalName
- 哪种service都可以暴露到externalip 上
kind: Service apiVersion: v1 metadata: name: my-service spec: selector: app: MyApp ports: - protocol: TCP port: 80 # service暴露的port targetPort: 9376 #默认 = port 指向的port复制代码
kind: Service
apiVersion: v1
metadata:
name: my-service
namespace: prod
spec:
type: ExternalName
externalName: my.database.example.com复制代码
DNS Pods and Services
- 支持 my-svc.my-namespace.svc.cluster.local, pod-ip-address.my-namespace.pod.cluster.local
- 默认的如kubernetes
Connecting Applications with Services
tutorial
Ingress Resources
tutorial
Network Policies
tutorial
kubectl exec -ti busybox -- nslookup kubernetes.default复制代码